HELM: Hierarchical Encoding for mRNA Language Modeling

Authors: Mehdi Yazdani-Jahromi, Mangal Prakash, Tommaso Mansi, Artem Moskalev, Rui Liao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate HELM on diverse m RNA datasets and tasks, demonstrating that HELM outperforms standard language model pre-training as well as existing foundation model baselines on seven diverse downstream property prediction tasks and an antibody region annotation tasks on average by around 8%.
Researcher Affiliation Collaboration Mehdi Yazdani-Jahromi University of Central Florida EMAIL Mangal Prakash Johnson & Johnson Innovative Medicine EMAIL Tommaso Mansi Johnson & Johnson Innovative Medicine EMAIL Artem Moskalev Johnson & Johnson Innovative Medicine EMAIL Rui Liao Johnson & Johnson Innovative Medicine EMAIL
Pseudocode No The paper describes methods and mathematical formulations but does not present a clearly labeled pseudocode block or algorithm.
Open Source Code No Code, datasets, and model weights will be made publicly available to the research community, allowing others to reproduce, verify, and extend our work.
Open Datasets Yes For this reason, we curated the OAS database (Olsen et al., 2022) which contains antibody m RNA data from over 80 different studies with around 2 billion unpaired and 1.5 million paired sequences from various species. Although prior studies have curated this database on protein level (Ruffolo et al., 2021; Shuai et al., 2023; Kenlay et al., 2024) in the context of antibody-protein language modeling, a high-quality curated version of corresponding m RNA data does not exist.
Dataset Splits Yes For i Codon, Tc-Riboswitches, m RFP and COVID-19 Vaccine datasets, we use predefined splits from prior publications to ensure fair comparison. For other datasets, we apply clustering-based train/validation/test splitting (Lin Clust (Steinegger & S oding, 2018) similarity threshold 0.9) to prevent data leakage. We use a train/validation/test split ratio of 70:15:15.
Hardware Specification Yes All models are trained using 8 NVIDIA A100 GPUs, each with 80GB of GPU memory.
Software Dependencies No The paper mentions software like 'GPT-2', 'Mamba', 'Hyena' (model architectures), and 'AdamW optimizer', but does not specify version numbers for any key software components or libraries used in their implementation.
Experiment Setup Yes All models use 50M parameters, balancing performance and efficiency. We found that models of this scale can outperforms larger existing models while maintaining reasonable run-times (see Appendix A.8). Detailed pre-training information is available in Appendix A.2. Table 3: Hyperparameters for trained LM Models Hyperparameter GPT-2 Mamba Hyena Number of layers 10 40 7 Hidden size 640 256 768 Intermediate size 2560 1024 3072 Batch size 1024 1024 1024 Learning rate (XE) 1e-3 1e-3 1e-4 Learning rate (HXE) 1e-4 1e-4 1e-4 Minimum learning rate 1e-5 1e-5 1e-6 Weight decay 0.1 0.1 0.1 Number of epochs 40 40 40 Vocabulary size 70 70 70