Language Models for Controllable DNA Sequence Design
Authors: Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on Ch IP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. |
| Researcher Affiliation | Academia | Xingyu Su EMAIL Texas A&M University Xiner Li EMAIL Texas A&M University Yuchao Lin EMAIL Texas A&M University Ziqian Xie EMAIL University of Texas Health Science Center at Houston Degui Zhi EMAIL University of Texas Health Science Center at Houston Shuiwang Ji EMAIL Texas A&M University |
| Pseudocode | No | The paper describes the methodology in narrative text and provides figures (e.g., Figure 1 for framework overview) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is released at (https://github.com/divelab/AIRS/blob/main/Open Bio/ATGC_Gen). |
| Open Datasets | Yes | Our proposed dataset is obtained from a comprehensive collection of Ch IP-Seq experiments generated by the ENCODE project (Consortium et al., 2012)... This dataset contains 100,000 promoter sequences from GRCh38 (HG38) human reference genome. We use the same DNA sequences as in previous works (Avdeyev et al., 2023; Stark et al., 2024)... This task involves two distinct datasets: one from fly brain (Janssens et al., 2022) and the other from human melanoma cells (Atak et al., 2021). |
| Dataset Splits | Yes | After completing the preprocessing steps, we partition the dataset by chromosome: chromosomes 20 and 21 are used for validation, while chromosomes 22 and X are held out for testing. The remaining chromosomes are used for training. |
| Hardware Specification | Yes | All training and inference are conducted on NVIDIA A100-SXM-80GB GPUs. |
| Software Dependencies | No | The paper mentions software components like "bert-base configuration", "GPT-style decoder", and "AdamW optimizer" but does not provide specific version numbers for any key software libraries or programming languages. |
| Experiment Setup | Yes | We use the Adam W optimizer with a learning rate of 1 10 4 and a linear warmup over the first 10% of training epochs. Model selection is based on performance on the validation set. The batch size is adjusted to fit within GPU memory constraints. |