reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hyperbolic Genome Embeddings

Authors: Raiyan Khan, Philippe Chlenski, Itsik Pe'er

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across 37 out of 42 genome interpretation benchmark datasets, our hyperbolic models outperform their Euclidean equivalents. Notably, our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets, consistently outperforming many DNA language models while using orders of magnitude fewer parameters and avoiding pretraining. Our results include a novel set of benchmark datasets the Transposable Elements Benchmark which explores a major but understudied component of the genome with deep evolutionary significance. We further motivate our work by exploring how our hyperbolic models recognize genomic signal under various datagenerating conditions and by constructing an empirical method for interpreting the hyperbolicity of dataset embeddings. Throughout these assessments, we find persistent evidence highlighting the potential of our hyperbolic framework as a robust paradigm for genome representation learning.
Researcher Affiliation	Academia	Raiyan R. Khan, Philippe Chlenski, Itsik Pe er Columbia University EMAIL
Pseudocode	No	The paper describes methods through mathematical equations and textual explanations, but does not contain explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code and benchmark datasets are available at https://github.com/rrkhan/HGE.
Open Datasets	Yes	Our code and benchmark datasets are available at https://github.com/rrkhan/HGE. For the retrotransposon and DNA transposon tasks, we crafted a dataset by employing annotations from Plant Rep (Luo et al., 2022), a database that provides comprehensive annotations of plant repetitive elements across 459 plant genomes. Oryza glumipatula genome (v1.5) was downloaded from the NCBI genome browser (https://ftp.ncbi.nlm.nih.gov).
Dataset Splits	Yes	We used a chromosome level train/validation/test split for our sequences, separating out chromosomes 8/9 and 20-22/17-19 for validation/test sets in Oryza glumipatula and human, respectively, while the remaining chromosomes were used for the training sets.
Hardware Specification	No	The paper does not provide specific hardware details (like exact GPU/CPU models or processor types) used for running its experiments.
Software Dependencies	No	The paper mentions using packages like the Environment for Tree Exploration (ETE) toolkit, PYVOLVE package, MANIFY package, and NETWORKX package, but does not provide specific version numbers for these software dependencies as required for replication.
Experiment Setup	Yes	Table 3: Hyperparameter settings for CNN/HCNN training. It includes Optimizer, Learning Rate (TEB/GUE/GB), Manifold Learning Rate, Batch size, Weight decay, Epochs, and β1, β2 values for Euclidean CNN, HCNN-S, and HCNN-M models.