SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model

Authors: Zhao Yang, Jiwei Zhu, Bing Su

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. We conducted rigorous benchmarking against the suite of 18 genomic datasets established in NT (Dalla-Torre et al., 2024), encompassing three fundamental task categories: (1) histone modification marker prediction, (2) cisregulatory element annotation, and (3) splice site recognition.
Researcher Affiliation Academia 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Beijing Key Laboratory of Research on Large Models and Intelligent Governance 3Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE. Correspondence to: Bing Su <EMAIL>.
Pseudocode No The paper describes the model architecture and methods primarily through textual descriptions and mathematical formulations (Equations 1-8), along with diagrams. It does not explicitly present pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/ZhuJiwei111/SPACE.
Open Datasets Yes The pre-training dataset aligns with that used in Enformer (Kelley, 2020; Avsec et al., 2021), containing DNA sequences and corresponding genomic profiles for human and mouse genomes. The benchmark dataset comprises 18 downstream tasks originally proposed in Nucleotide Transformer (NT) (Dalla Torre et al., 2024), accessible via https://huggingface.co/datasets/Insta Deep AI/nucleotide_ transformer_downstream_tasks_revised. GUE is a comprehensive benchmark for genome understanding consising of 28 distinct datasets across 7 tasks and 4 species, downloaded from https://github.com/MAGICS-LAB/DNABERT_2. Genomic Benchmarks currently comprises nine datasets focusing on regulatory elements ... All data were downloaded from https://github.com/ML-Bioinfo-CEITEC/genomic_ benchmarks.
Dataset Splits Yes The pre-training dataset aligns with that used in Enformer (Kelley, 2020; Avsec et al., 2021), containing DNA sequences and corresponding genomic profiles for human and mouse genomes. ... Human 34,021 2,213 1,937 131,072 bp Mouse 29,295 2,209 2,017 131,072 bp. In alignment with NT s methodology, we implemented 10-fold crossvalidation with fixed random seeds (0-9) and early stopping based on validation performance. All evaluations strictly adhered to benchmark specifications, including standardized train-test splits and hyperparameter configurations, to maintain reproducibility and fairness.
Hardware Specification Yes For cross-species joint modeling, we implemented an alternating training strategy using eight NVIDIA A40 GPUs.
Software Dependencies No Optimization employed Adam W (Loshchilov & Hutter, 2019) with an initial learning rate of 5 10 4, linearly ramped from 0 during the first 5,000 steps followed by cosine decay. The training protocol utilized the Adam W optimizer (Loshchilov & Hutter, 2019) over 3 epochs, while retaining default parameter settings from the Hugging Face Transformer Trainer implementation (Wolf et al., 2020). The paper mentions software components like Adam W and Hugging Face Transformer Trainer but does not specify their version numbers.
Experiment Setup Yes Training proceeded for 50,000 steps (approximately 8 days) with a global batch size of 64, achieved through 8 gradient accumulation steps (1 sample per GPU). Optimization employed Adam W (Loshchilov & Hutter, 2019) with an initial learning rate of 5 10 4, linearly ramped from 0 during the first 5,000 steps followed by cosine decay. Gradient norms were clipped at 0.2 to maintain stability. Our systematic hyperparameter search included learning rates of 5 10 5, 3 10 5, and 5 10 4, combined with batch sizes of 8, 16, and 32. Through empirical validation, we identified the optimal configuration employing a learning rate of 5 10 5 with batch size 8. The training protocol utilized the Adam W optimizer (Loshchilov & Hutter, 2019) over 3 epochs.