L2G: Repurposing Language Models for Genomics Tasks
Authors: Wenduo Cheng, Junhong Shen, Mikhail Khodak, Jian Ma, Ameet Talwalkar
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the empirical effectiveness and efficiency of L2G through extensive experiments on two genomics benchmarks and a challenging regression task for enhancer activity prediction. Beyond presenting results on predictive accuracy, we assess L2G s ability to learn relevant TF motifs and evaluate the efficacy of cross-modal fine-tuning through embedding analyses and ablation studies. |
| Researcher Affiliation | Academia | Wenduo Cheng EMAIL Ray and Stephanie Lane Computational Biology Department Carnegie Mellon University Junhong Shen EMAIL Machine Learning Department Carnegie Mellon University Mikhail Khodak EMAIL Princeton Language & Intelligence, Princeton AI Lab Princeton Language Jian Ma EMAIL Ray and Stephanie Lane Computational Biology Department Carnegie Mellon University Ameet Talwalkar EMAIL Machine Learning Department Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 Pseudocode for the L2G workflow. Input: Genomic Dataset G, Set of Embedder Backbone Architectures B, Language Model L, Alignment Loss Weight α, Task Specific Loss Weight β for each architecture b B do Initialize b val_scoreb Train b for one epoch on G best_b arg maxb B val_scoreb ; // Select the embedder backbone with the best validation score (k, d) DASH(best_b) ; // Optimize the kernels and dilations h_text Inference L on the source text dataset ; // Generate text embeddings Initialize best_b with (k, d) for epoch embedder_epochs do pred_1, h_DNA best_b(G) loss_1 LMMD(h_text, h_DNA) loss_2 Ltask(pred_1, labels) embedder min(α loss_1 + β loss_2) model embedder + transformer blocks from L + linear predictor pred_2 Train model on G return pred_2 |
| Open Source Code | Yes | A.1 Code Availability The source code of L2G can be accessed at: https://github.com/wenduocheng/L2G. |
| Open Datasets | Yes | A.2 Data Availability In this work, we utilized several public datasets. The Genomic Benchmark is available at: https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks. The Nucleotide Transformer benchmarks can be downloaded from Hugging Face at: https://huggingface.co/datasets/Insta Deep AI/nucleotide_transformer_downstream_tasks. The DART-Eval benchmars is available at: https://github.com/kundajelab/DART-Eval. The Deep STARR dataset is available on Zenodo at: https://doi.org/10.5281/zenodo.5502060. |
| Dataset Splits | Yes | The Genomic Benchmarks dataset (Grešová et al., 2023) includes eight classification tasks: seven binary and one three-way classification task... The Nucleotide Transformer Benchmarks dataset... evaluates genomic FMs on 18 classification tasks... DART-Eval is a recent benchmark that curates biologically significant tasks... Developmental and Housekeeping Enhancer Activity Predictions is a two-class regression task... The dataset, sourced from the Deep STARR project (de Almeida et al., 2022). |
| Hardware Specification | Yes | All our experiments can be performed on a single A6000 GPU in a matter of hours by leveraging existing open-source language models, compared to days of training needed to develop genomic FMs from scratch. |
| Software Dependencies | No | The paper mentions several software tools and libraries such as PyTorch, Keras, TensorFlow, RoBERTa-base, Deep Lift Shap, TF-Modisco-lite, but does not provide specific version numbers for these software components. For example, it does not state 'PyTorch 1.9' or 'TensorFlow 2.x'. |
| Experiment Setup | Yes | Table 15 provides the hyperparameter settings used for training L2G. Hyperparameter Value Distribution Alignment Metric MMD Transformer Backbone Ro BERTa-base Target Sequence Length 512 Training Epochs 25 Embedder Pre-training Epochs 80-100 Warm-up Epochs 5 Decay Epochs 25 α (Weight for Alignment Loss) 1 β (Weight for Task Loss) 1 Dropout 0.05 Gradient Clipping [-1, 1] Batch Size 64-128 Embedder Pre-training Optimizer SGD Embedder Pre-training Learning Rate Searched by DASH Fine-tuning Optimizer Adam Fine-tuning Optimizer Betas [0.9, 0.98] Fine-tuning Learning Rate 1e-5 Weight Decay 1e-5 Scheduler Step Decay |