reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Harmonic Indel Distance

Authors: Bob Pepin

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Then we perform an experimental comparison of HID to normalized and unnormalized versions of the indel distance on benchmark tasks for biomedical sequence data. We finally show planar embeddings of the benchmark datasets to provide some insights into the geometry of the metric spaces associated with the different distance metrics. Section 4 Experiments The purpose of the experiments in this section is to compare the HID to other string distances when applied to machine learning tasks
Researcher Affiliation	Academia	Bob Pepin EMAIL Department of Computer Science University of Copenhagen
Pseudocode	No	The paper defines the harmonic indel distance using a mathematical formula and then proceeds with proofs and experimental comparisons. It does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using 'existing libraries for computing LCS, available in all major programming languages' and 'the SVM implementation from Scikit-Learn (Pedregosa et al., 2011)'. However, it does not explicitly state that the authors are releasing their own source code for the methodology or experiments described in this paper.
Open Datasets	Yes	The classification task involves the classification of sequences of non-coding RNA according to their type and uses the Dataset2 dataset from the benchmark paper Creux et al. (2024). We evaluate the regression performance on the thermostability prediction task from the FLIP benchmark for protein sequences (Dallago et al., 2021).
Dataset Splits	Yes	The dataset provides training and test splits, and a validation set was generated by random splitting of the provided training set. This is a challenging benchmark which includes a carefully selected train-validation-test split based on biological considerations. Table 1: Dataset statistics Number of Sequences Sequence Length Dataset Training Validation Test Min. Max. Median nc RNA 25371 6430 13646 42 500 123 FLIP Mixed 22335 2482 3134 20 35213 413 FLIP Human 7287 861 1945 39 34350 477.0 FLIP Human-Cell 5149 643 1366 44 34350 469.0
Hardware Specification	No	The paper mentions software used for experiments (Optuna, Scikit-Learn) but does not provide specific hardware details such as CPU/GPU models, memory, or other computer specifications used for running the experiments.
Software Dependencies	No	The SVM margin as well as the RBF variance hyperparameters were optimized using the Tree-structured Parzen Estimator algorithm implemented in the Optuna software (Akiba et al., 2019). We used the SVM implementation from Scikit-Learn (Pedregosa et al., 2011). No specific version numbers are provided for these software dependencies.
Experiment Setup	Yes	All benchmark experiments used support vector machines with radial basis function kernels based on the string metrics described above. The SVM margin as well as the RBF variance hyperparameters were optimized using the Tree-structured Parzen Estimator algorithm implemented in the Optuna software (Akiba et al., 2019). Table 4: Hyperparameters Dataset Distance Metric γ C nc RNA HID 8.443116755229333 99.5612463103948 nc RNA STID 9.91793899262693 1.485844877379199 nc RNA ID 0.00012069602683643651 0.3910622178775264 FLIP Mixed HID 2.694680159717171 5.303192435782873 FLIP Mixed STID 3.4503356492866195 1.696222220623828 FLIP Mixed ID 0.00014856111427324835 2.018895367718889 FLIP Human HID 3.0711811333241985 8.972094031636633 FLIP Human STID 2.2338162722360013 2.472937362472116 FLIP Human ID 0.00010361128449343376 0.6548096783807119 FLIP Human-Cell HID 2.211198503980429 5.994311048805454 FLIP Human-Cell STID 4.758786457584244 29.469372386529603 FLIP Human-Cell ID 0.00010519254700762129 0.02074221663631553 All t-SNE embeddings used a target perplexity of 30 and were run for the number of iterations given in Table 5.