reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Models for Controllable DNA Sequence Design

Authors: Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on Ch IP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties.
Researcher Affiliation	Academia	Xingyu Su EMAIL Texas A&M University Xiner Li EMAIL Texas A&M University Yuchao Lin EMAIL Texas A&M University Ziqian Xie EMAIL University of Texas Health Science Center at Houston Degui Zhi EMAIL University of Texas Health Science Center at Houston Shuiwang Ji EMAIL Texas A&M University
Pseudocode	No	The paper describes the methodology in narrative text and provides figures (e.g., Figure 1 for framework overview) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The source code is released at (https://github.com/divelab/AIRS/blob/main/Open Bio/ATGC_Gen).
Open Datasets	Yes	Our proposed dataset is obtained from a comprehensive collection of Ch IP-Seq experiments generated by the ENCODE project (Consortium et al., 2012)... This dataset contains 100,000 promoter sequences from GRCh38 (HG38) human reference genome. We use the same DNA sequences as in previous works (Avdeyev et al., 2023; Stark et al., 2024)... This task involves two distinct datasets: one from fly brain (Janssens et al., 2022) and the other from human melanoma cells (Atak et al., 2021).
Dataset Splits	Yes	After completing the preprocessing steps, we partition the dataset by chromosome: chromosomes 20 and 21 are used for validation, while chromosomes 22 and X are held out for testing. The remaining chromosomes are used for training.
Hardware Specification	Yes	All training and inference are conducted on NVIDIA A100-SXM-80GB GPUs.
Software Dependencies	No	The paper mentions software components like "bert-base configuration", "GPT-style decoder", and "AdamW optimizer" but does not provide specific version numbers for any key software libraries or programming languages.
Experiment Setup	Yes	We use the Adam W optimizer with a learning rate of 1 10 4 and a linear warmup over the first 10% of training epochs. Model selection is based on performance on the validation set. The batch size is adjusted to fit within GPU memory constraints.