The Harmonic Indel Distance

Authors: Bob Pepin

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Then we perform an experimental comparison of HID to normalized and unnormalized versions of the indel distance on benchmark tasks for biomedical sequence data. We finally show planar embeddings of the benchmark datasets to provide some insights into the geometry of the metric spaces associated with the different distance metrics. Section 4 Experiments The purpose of the experiments in this section is to compare the HID to other string distances when applied to machine learning tasks
Researcher Affiliation Academia Bob Pepin EMAIL Department of Computer Science University of Copenhagen
Pseudocode No The paper defines the harmonic indel distance using a mathematical formula and then proceeds with proofs and experimental comparisons. It does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using 'existing libraries for computing LCS, available in all major programming languages' and 'the SVM implementation from Scikit-Learn (Pedregosa et al., 2011)'. However, it does not explicitly state that the authors are releasing their own source code for the methodology or experiments described in this paper.
Open Datasets Yes The classification task involves the classification of sequences of non-coding RNA according to their type and uses the Dataset2 dataset from the benchmark paper Creux et al. (2024). We evaluate the regression performance on the thermostability prediction task from the FLIP benchmark for protein sequences (Dallago et al., 2021).
Dataset Splits Yes The dataset provides training and test splits, and a validation set was generated by random splitting of the provided training set. This is a challenging benchmark which includes a carefully selected train-validation-test split based on biological considerations. Table 1: Dataset statistics Number of Sequences Sequence Length Dataset Training Validation Test Min. Max. Median nc RNA 25371 6430 13646 42 500 123 FLIP Mixed 22335 2482 3134 20 35213 413 FLIP Human 7287 861 1945 39 34350 477.0 FLIP Human-Cell 5149 643 1366 44 34350 469.0
Hardware Specification No The paper mentions software used for experiments (Optuna, Scikit-Learn) but does not provide specific hardware details such as CPU/GPU models, memory, or other computer specifications used for running the experiments.
Software Dependencies No The SVM margin as well as the RBF variance hyperparameters were optimized using the Tree-structured Parzen Estimator algorithm implemented in the Optuna software (Akiba et al., 2019). We used the SVM implementation from Scikit-Learn (Pedregosa et al., 2011). No specific version numbers are provided for these software dependencies.
Experiment Setup Yes All benchmark experiments used support vector machines with radial basis function kernels based on the string metrics described above. The SVM margin as well as the RBF variance hyperparameters were optimized using the Tree-structured Parzen Estimator algorithm implemented in the Optuna software (Akiba et al., 2019). Table 4: Hyperparameters Dataset Distance Metric γ C nc RNA HID 8.443116755229333 99.5612463103948 nc RNA STID 9.91793899262693 1.485844877379199 nc RNA ID 0.00012069602683643651 0.3910622178775264 FLIP Mixed HID 2.694680159717171 5.303192435782873 FLIP Mixed STID 3.4503356492866195 1.696222220623828 FLIP Mixed ID 0.00014856111427324835 2.018895367718889 FLIP Human HID 3.0711811333241985 8.972094031636633 FLIP Human STID 2.2338162722360013 2.472937362472116 FLIP Human ID 0.00010361128449343376 0.6548096783807119 FLIP Human-Cell HID 2.211198503980429 5.994311048805454 FLIP Human-Cell STID 4.758786457584244 29.469372386529603 FLIP Human-Cell ID 0.00010519254700762129 0.02074221663631553 All t-SNE embeddings used a target perplexity of 30 and were run for the number of iterations given in Table 5.