Fast and Low-Cost Genomic Foundation Models via Outlier Removal
Authors: Haozheng Luo, Chenghao Qiu, Maojiang Su, Zhihan Zhou, Zoe Mehta, Guo Ye, Jerry Yao-Chieh Hu, Han Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, GERM improves fine-tuning performance by 37.98% and quantization by 64.34% over the baseline model. It also reduces average kurtosis by 92.14% and maximum infinity norm by 82.77%. Compared to leading methods, GERM consistently delivers superior performance, offering a practical solution for genomic modeling in resource-constrained settings. Code is available at https://github.com/MAGICS-LAB/GERM. |
| Researcher Affiliation | Academia | 1Northwestern University 2Tianjin University 3Vernon Hills High School. Correspondence to: Haozheng Luo <EMAIL>, Chenghao Qiu <EMAIL>, Maojiang Su <EMAIL>, Zhihan Zhou <EMAIL>, Zoe Mehta <EMAIL>, Guo Ye <EMAIL>, Jerry Yao-Chieh Hu <EMAIL>, Han Liu <EMAIL>. |
| Pseudocode | No | The paper includes a theoretical analysis in Appendix A with definitions and theorems, but does not present any structured pseudocode or algorithm blocks in a clear, formatted manner. |
| Open Source Code | Yes | Code is available at https://github.com/MAGICS-LAB/GERM. |
| Open Datasets | Yes | We utilize 27 datasets spanning 7 tasks and 4 species, as outlined in (Zhou et al., 2024). ... Additionally, we analyze related Gen Bench datasets (Liu et al., 2025) and find that, uniquely, Gen Bench includes some regression downstream tasks, providing a broader evaluation spectrum. |
| Dataset Splits | No | The paper states: 'We utilize 27 datasets spanning 7 tasks and 4 species, as outlined in (Zhou et al., 2024).' and mentions 'We evaluate the models on the test datasets...' but does not explicitly provide specific training/test/validation split percentages, sample counts, or detailed splitting methodology within this paper. It defers to the cited paper for dataset details. |
| Hardware Specification | Yes | We perform all experiments using 2 NVIDIA A100 GPU with 80GB of memory and a 24-core Intel(R) Xeon(R) Gold 6338 CPU operating at 2.00GHz. ... Our model fine-tunes DNABERT in just 5 minutes on a single NVIDIA Ge Force RTX 2080 Ti GPU. ... To demonstrate GERM s capability in CPU-only computing environments, we perform performance tests on an 64-core Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 50GB RAM. |
| Software Dependencies | No | Our code is developed in Py Torch and utilizes the Hugging Face Transformer Library for experimental execution. The paper mentions these software components but does not provide specific version numbers for them. |
| Experiment Setup | Yes | We use Adam W (Loshchilov & Hutter, 2019) as the optimizer. Most of the other hyperparameters remain the same across all models and datasets, including a batch size of 32, a warmup step of 50, and a weight decay of 0.01. A learning rate of 3e 5 is used for all models during fine-tuning. For low-rank adaptation, we use a learning rate of 1e 4, with a Lo RA rank of 8 and Lo RA alpha set to 16. For each task, we use different training steps as shown in Table 5. During pre-training, the model is trained for 200,000 steps with a batch size of 1024 and a maximum sequence length of 512, using the Adam W optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 1e 6. |