HELM: Hierarchical Encoding for mRNA Language Modeling
Authors: Mehdi Yazdani-Jahromi, Mangal Prakash, Tommaso Mansi, Artem Moskalev, Rui Liao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate HELM on diverse m RNA datasets and tasks, demonstrating that HELM outperforms standard language model pre-training as well as existing foundation model baselines on seven diverse downstream property prediction tasks and an antibody region annotation tasks on average by around 8%. |
| Researcher Affiliation | Collaboration | Mehdi Yazdani-Jahromi University of Central Florida EMAIL Mangal Prakash Johnson & Johnson Innovative Medicine EMAIL Tommaso Mansi Johnson & Johnson Innovative Medicine EMAIL Artem Moskalev Johnson & Johnson Innovative Medicine EMAIL Rui Liao Johnson & Johnson Innovative Medicine EMAIL |
| Pseudocode | No | The paper describes methods and mathematical formulations but does not present a clearly labeled pseudocode block or algorithm. |
| Open Source Code | No | Code, datasets, and model weights will be made publicly available to the research community, allowing others to reproduce, verify, and extend our work. |
| Open Datasets | Yes | For this reason, we curated the OAS database (Olsen et al., 2022) which contains antibody m RNA data from over 80 different studies with around 2 billion unpaired and 1.5 million paired sequences from various species. Although prior studies have curated this database on protein level (Ruffolo et al., 2021; Shuai et al., 2023; Kenlay et al., 2024) in the context of antibody-protein language modeling, a high-quality curated version of corresponding m RNA data does not exist. |
| Dataset Splits | Yes | For i Codon, Tc-Riboswitches, m RFP and COVID-19 Vaccine datasets, we use predefined splits from prior publications to ensure fair comparison. For other datasets, we apply clustering-based train/validation/test splitting (Lin Clust (Steinegger & S oding, 2018) similarity threshold 0.9) to prevent data leakage. We use a train/validation/test split ratio of 70:15:15. |
| Hardware Specification | Yes | All models are trained using 8 NVIDIA A100 GPUs, each with 80GB of GPU memory. |
| Software Dependencies | No | The paper mentions software like 'GPT-2', 'Mamba', 'Hyena' (model architectures), and 'AdamW optimizer', but does not specify version numbers for any key software components or libraries used in their implementation. |
| Experiment Setup | Yes | All models use 50M parameters, balancing performance and efficiency. We found that models of this scale can outperforms larger existing models while maintaining reasonable run-times (see Appendix A.8). Detailed pre-training information is available in Appendix A.2. Table 3: Hyperparameters for trained LM Models Hyperparameter GPT-2 Mamba Hyena Number of layers 10 40 7 Hidden size 640 256 768 Intermediate size 2560 1024 3072 Batch size 1024 1024 1024 Learning rate (XE) 1e-3 1e-3 1e-4 Learning rate (HXE) 1e-4 1e-4 1e-4 Minimum learning rate 1e-5 1e-5 1e-6 Weight decay 0.1 0.1 0.1 Number of epochs 40 40 40 Vocabulary size 70 70 70 |