reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Biological Sequence Kernels with Guaranteed Flexibility

Authors: Alan N. Amin, Debora S. Marks, Eli N. Weinstein

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate our theoretical results in simulation and on real biological data sets. ... In this section we examine the performance, on real biological sequence datasets, of some of our proposed kernels with discrete masses. We compare each to an existing kernel that relies on a similar notion of sequence similarity but lacks discrete masses.
Researcher Affiliation	Academia	Alan N. Amin EMAIL New York University Debora S. Marks EMAIL Harvard Medical School Broad Institute of Harvard and MIT Eli N. Weinstein EMAIL Technical University of Denmark
Pseudocode	Yes	Appendix I. Computation of thick tailed alignment kernels. We can now compute M, IX and IY with dynamic programming. We initialize at, M(i, 0, l) = M(0, j, l) = 0 for l, i, j M(0, 0, 0) = 1 IX(0, j, l) = IY (i, 0, l) = 0 for l, i, j. The update equations are, M(i, j, l) = 1(l > 0)ℓ(X(i 1), Y(j 1))ks(X(i 1), Y(j 1))J(i 1, j 1, l 1) + 1 ℓ(X(i 1), Y(j 1)) ks(X(i 1), Y(j 1))J(i 1, j 1, l) IX(i, j, l) = e µ µM(i 1, j, l) + e µIX(i 1, j, l) IY (i, j, l) = e µ µM(i, j 1, l) + e µ µIX(i, j 1, l) + e µIY (i 1, j, l).
Open Source Code	Yes	Code can be found at https://github.com/AlanNawzadAmin/Kernels-with-guarantees/.
Open Datasets	Yes	The data consists of pairs of DNA sequences and transcription factor binding strengths, measured in terms of the intensity of a fluorescent signal in a micro-array assay (Barrera et al., 2016). ... For each patient, we have a data set of TCR CDR3 sequences, which vary in length from 10 to 19 amino acids (10x Genomics, 2022). ... We use N = 100 human TCR CDR3 sequences, with lengths varying between 10 and 17, as the target set (10x Genomics, 2022).
Dataset Splits	No	The paper mentions sub-sampling data for evaluation (e.g., "25 random sub-samples of a large data set" in Section 10.2, and "N is the size of the sub-sampled data set" in Figure 2), but does not specify explicit training/test/validation dataset splits with percentages, absolute sample counts, or references to predefined splits for model development and evaluation.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. While it mentions the use of GPUs in the context of efficiency of alignment kernels in related work ("the algorithm can be made much more eﬃcient in practice with modest assumptions and the use of GPUs (Rush, 2020)"), it does not specify the models or configurations of any GPUs, CPUs, or other computing resources used by the authors for their empirical results.
Software Dependencies	No	The paper provides a link to its code repository (https://github.com/AlanNawzadAmin/Kernels-with-guarantees/), implying the existence of software, but it does not explicitly list any software dependencies with specific version numbers within the text.
Experiment Setup	Yes	We then optimize MMD(δX, p Y ) by taking the best substitution, insertion or deletion of a single amino acid at each step, for 100 steps. ... For the scaled embedding F(X) = 20(1+ϵ)\|X\|/64 F(X), with ϵ = 0.1.