reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisiting Convolution Architecture in the Realm of DNA Foundation Models

Authors: Yu Bo, Weian Mao, Daniel Shao, Weiqiang Bai, Peng Ye, Xinzhu Ma, Junbo Zhao, Hao Chen, Chunhua Shen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive empirical experiments, we demonstrate that Conv Nova significantly outperforms recent methods on more than half of the tasks across several foundation model benchmarks. For example, in histone-related tasks, Conv Nova exceeds the second-best method by an average of 5.8%, while generally utilizing fewer parameters and allowing faster computation. In addition, the experiments observed findings that may be related to biological characteristics.
Researcher Affiliation	Collaboration	1 Zhejiang University 2 MIT, USA 3 Yale University, USA 4 Shanghai AI Lab 5 Ant Group 6 Zhejiang University of Technology
Pseudocode	No	The paper describes the Gated Convolution Block and its variants using mathematical equations (Eq.1, Eq.2, Eq.3, Eq.4) in sections 3.2 and A.3.1. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Code is available at: https://github.com/aim-uofa/Conv Nova
Open Datasets	Yes	We use the bidirectional masked language model (MLM) pretraining method... The pretraining data is HG38, same as Hyena DNA (Nguyen et al., 2024). We start the evaluation with the recently proposed Nucleotide Benchmark (Dalla-Torre et al., 2023). Next, we conduct a comprehensive evaluation using eight datasets introduced by Genomics Benchmarks (Grešová et al., 2023). The task employs the GENCODE dataset (Harrow et al., 2012).
Dataset Splits	Yes	A.2.1 NUCLEOTIDE TRANSFORMER BENCHMARK: Models and Setup: We follow the Hyena DNA setup (Nguyen et al., 2024), splitting the dataset into 90/10 training and test sets... A.2.2 CHROMATIN PROFILE PREDICTION: Experiment Configuration: The train/test split follows the Deep SEA paper methodology. A.2.4 GENOMIC BENCHMARK: Models and Setup: We follow the train-valid split (90/10) as provided by Hyena DNA (Nguyen et al., 2024).
Hardware Specification	Yes	Figure 1: All models are around 7M parameters, tested on A100 80GB with batch size 1. A.1 PRETRAINING: The time required for pretraining different model sizes with varying sequence lengths on 4 RTX-4090 GPUs is reported.
Software Dependencies	No	The paper mentions using 'Adam W' as an optimizer and 'Hugging Face' for pre-trained weights. However, it does not provide specific version numbers for any software libraries, programming languages, or other dependencies required to replicate the experiments.
Experiment Setup	Yes	A.1 PRETRAINING: We set the global batch size to 512 and trained for 400 epochs with a learning rate of 1e-3... Table 7: Pretraining hyperparameters. Values used during pretraining are reported. Learning Rate 1e-3 Batch Size 256 Weight Decay 0.1 Dropout 0.0 Optimizer Adam W Optimizer Momentum β1 = 0.9, β2 = 0.999 Learning Rate Scheduler Cosine Decay Training Epochs 200. Table 9: Hyperparameters for Conv Nova model on all downstream tasks. Specific settings for each task include sequence length, dilation rate, hidden dimension, number of GCBs, and model size.