Biological Sequence Kernels with Guaranteed Flexibility
Authors: Alan N. Amin, Debora S. Marks, Eli N. Weinstein
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate our theoretical results in simulation and on real biological data sets. ... In this section we examine the performance, on real biological sequence datasets, of some of our proposed kernels with discrete masses. We compare each to an existing kernel that relies on a similar notion of sequence similarity but lacks discrete masses. |
| Researcher Affiliation | Academia | Alan N. Amin EMAIL New York University Debora S. Marks EMAIL Harvard Medical School Broad Institute of Harvard and MIT Eli N. Weinstein EMAIL Technical University of Denmark |
| Pseudocode | Yes | Appendix I. Computation of thick tailed alignment kernels. We can now compute M, IX and IY with dynamic programming. We initialize at, M(i, 0, l) = M(0, j, l) = 0 for l, i, j M(0, 0, 0) = 1 IX(0, j, l) = IY (i, 0, l) = 0 for l, i, j. The update equations are, M(i, j, l) = 1(l > 0)ℓ(X(i 1), Y(j 1))ks(X(i 1), Y(j 1))J(i 1, j 1, l 1) + 1 ℓ(X(i 1), Y(j 1)) ks(X(i 1), Y(j 1))J(i 1, j 1, l) IX(i, j, l) = e µ µM(i 1, j, l) + e µIX(i 1, j, l) IY (i, j, l) = e µ µM(i, j 1, l) + e µ µIX(i, j 1, l) + e µIY (i 1, j, l). |
| Open Source Code | Yes | Code can be found at https://github.com/AlanNawzadAmin/Kernels-with-guarantees/. |
| Open Datasets | Yes | The data consists of pairs of DNA sequences and transcription factor binding strengths, measured in terms of the intensity of a fluorescent signal in a micro-array assay (Barrera et al., 2016). ... For each patient, we have a data set of TCR CDR3 sequences, which vary in length from 10 to 19 amino acids (10x Genomics, 2022). ... We use N = 100 human TCR CDR3 sequences, with lengths varying between 10 and 17, as the target set (10x Genomics, 2022). |
| Dataset Splits | No | The paper mentions sub-sampling data for evaluation (e.g., "25 random sub-samples of a large data set" in Section 10.2, and "N is the size of the sub-sampled data set" in Figure 2), but does not specify explicit training/test/validation dataset splits with percentages, absolute sample counts, or references to predefined splits for model development and evaluation. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. While it mentions the use of GPUs in the context of efficiency of alignment kernels in related work ("the algorithm can be made much more efficient in practice with modest assumptions and the use of GPUs (Rush, 2020)"), it does not specify the models or configurations of any GPUs, CPUs, or other computing resources used by the authors for their empirical results. |
| Software Dependencies | No | The paper provides a link to its code repository (https://github.com/AlanNawzadAmin/Kernels-with-guarantees/), implying the existence of software, but it does not explicitly list any software dependencies with specific version numbers within the text. |
| Experiment Setup | Yes | We then optimize MMD(δX, p Y ) by taking the best substitution, insertion or deletion of a single amino acid at each step, for 100 steps. ... For the scaled embedding F(X) = 20(1+ϵ)|X|/64 F(X), with ϵ = 0.1. |