Revisiting Convolution Architecture in the Realm of DNA Foundation Models

Authors: Yu Bo, Weian Mao, Daniel Shao, Weiqiang Bai, Peng Ye, Xinzhu Ma, Junbo Zhao, Hao Chen, Chunhua Shen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive empirical experiments, we demonstrate that Conv Nova significantly outperforms recent methods on more than half of the tasks across several foundation model benchmarks. For example, in histone-related tasks, Conv Nova exceeds the second-best method by an average of 5.8%, while generally utilizing fewer parameters and allowing faster computation. In addition, the experiments observed findings that may be related to biological characteristics.
Researcher Affiliation Collaboration 1 Zhejiang University 2 MIT, USA 3 Yale University, USA 4 Shanghai AI Lab 5 Ant Group 6 Zhejiang University of Technology
Pseudocode No The paper describes the Gated Convolution Block and its variants using mathematical equations (Eq.1, Eq.2, Eq.3, Eq.4) in sections 3.2 and A.3.1. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes Code is available at: https://github.com/aim-uofa/Conv Nova
Open Datasets Yes We use the bidirectional masked language model (MLM) pretraining method... The pretraining data is HG38, same as Hyena DNA (Nguyen et al., 2024). We start the evaluation with the recently proposed Nucleotide Benchmark (Dalla-Torre et al., 2023). Next, we conduct a comprehensive evaluation using eight datasets introduced by Genomics Benchmarks (Grešová et al., 2023). The task employs the GENCODE dataset (Harrow et al., 2012).
Dataset Splits Yes A.2.1 NUCLEOTIDE TRANSFORMER BENCHMARK: Models and Setup: We follow the Hyena DNA setup (Nguyen et al., 2024), splitting the dataset into 90/10 training and test sets... A.2.2 CHROMATIN PROFILE PREDICTION: Experiment Configuration: The train/test split follows the Deep SEA paper methodology. A.2.4 GENOMIC BENCHMARK: Models and Setup: We follow the train-valid split (90/10) as provided by Hyena DNA (Nguyen et al., 2024).
Hardware Specification Yes Figure 1: All models are around 7M parameters, tested on A100 80GB with batch size 1. A.1 PRETRAINING: The time required for pretraining different model sizes with varying sequence lengths on 4 RTX-4090 GPUs is reported.
Software Dependencies No The paper mentions using 'Adam W' as an optimizer and 'Hugging Face' for pre-trained weights. However, it does not provide specific version numbers for any software libraries, programming languages, or other dependencies required to replicate the experiments.
Experiment Setup Yes A.1 PRETRAINING: We set the global batch size to 512 and trained for 400 epochs with a learning rate of 1e-3... Table 7: Pretraining hyperparameters. Values used during pretraining are reported. Learning Rate 1e-3 Batch Size 256 Weight Decay 0.1 Dropout 0.0 Optimizer Adam W Optimizer Momentum β1 = 0.9, β2 = 0.999 Learning Rate Scheduler Cosine Decay Training Epochs 200. Table 9: Hyperparameters for Conv Nova model on all downstream tasks. Specific settings for each task include sequence length, dilation rate, hidden dimension, number of GCBs, and model size.