Hyperbolic Genome Embeddings

Authors: Raiyan Khan, Philippe Chlenski, Itsik Pe'er

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across 37 out of 42 genome interpretation benchmark datasets, our hyperbolic models outperform their Euclidean equivalents. Notably, our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets, consistently outperforming many DNA language models while using orders of magnitude fewer parameters and avoiding pretraining. Our results include a novel set of benchmark datasets the Transposable Elements Benchmark which explores a major but understudied component of the genome with deep evolutionary significance. We further motivate our work by exploring how our hyperbolic models recognize genomic signal under various datagenerating conditions and by constructing an empirical method for interpreting the hyperbolicity of dataset embeddings. Throughout these assessments, we find persistent evidence highlighting the potential of our hyperbolic framework as a robust paradigm for genome representation learning.
Researcher Affiliation Academia Raiyan R. Khan, Philippe Chlenski, Itsik Pe er Columbia University EMAIL
Pseudocode No The paper describes methods through mathematical equations and textual explanations, but does not contain explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our code and benchmark datasets are available at https://github.com/rrkhan/HGE.
Open Datasets Yes Our code and benchmark datasets are available at https://github.com/rrkhan/HGE. For the retrotransposon and DNA transposon tasks, we crafted a dataset by employing annotations from Plant Rep (Luo et al., 2022), a database that provides comprehensive annotations of plant repetitive elements across 459 plant genomes. Oryza glumipatula genome (v1.5) was downloaded from the NCBI genome browser (https://ftp.ncbi.nlm.nih.gov).
Dataset Splits Yes We used a chromosome level train/validation/test split for our sequences, separating out chromosomes 8/9 and 20-22/17-19 for validation/test sets in Oryza glumipatula and human, respectively, while the remaining chromosomes were used for the training sets.
Hardware Specification No The paper does not provide specific hardware details (like exact GPU/CPU models or processor types) used for running its experiments.
Software Dependencies No The paper mentions using packages like the Environment for Tree Exploration (ETE) toolkit, PYVOLVE package, MANIFY package, and NETWORKX package, but does not provide specific version numbers for these software dependencies as required for replication.
Experiment Setup Yes Table 3: Hyperparameter settings for CNN/HCNN training. It includes Optimizer, Learning Rate (TEB/GUE/GB), Manifold Learning Rate, Batch size, Weight decay, Epochs, and β1, β2 values for Euclidean CNN, HCNN-S, and HCNN-M models.