LC-PLM: Long-context Protein Language Modeling Using Bidirectional Mamba with Shared Projection Layers

Authors: Yingheng Wang, Zichen Wang, Gil Sadeh, Luca Zancato, Alessandro Achille, George Karypis, Huzefa Rangwala

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments to evaluate the effectiveness of LC-PLM and LC-PLM-G and their building block Bi Mamba-S. We will address the following research questions. (RQ1) What is the scaling behavior of LC-PLM? How does it compare with its Transformer-based counterpart ESM-2? ... (RQ7) Does LC-PLM-G improve protein function prediction and link prediction on the PPI graph? We provide the experimental setup, dataset description, and task definition in Appendices C and D.
Researcher Affiliation Collaboration Yingheng Wang EMAIL Department of Computer Science, Cornell University... Zichen Wang , Gil Sadeh, Luca Zancato, Alessandro Achille, George Karypis, Huzefa Rangwala... Corresponding author: EMAIL. Huzefa Rangwala is on LOA as a Professor of Computer Science at George Mason University. This paper describes work performed at Amazon.
Pseudocode Yes Appendix K Pseudocode We provide a detailed breakdown of our algorithm in this section and then we summarize this computation procedure into a pseudocode algorithmic block as shown in Algorithm 1.
Open Source Code Yes Model is available at github.com/amazon-science/LC-PLM.
Open Datasets Yes We train a long-context p LM (LC-PLM) using bidirectional Mamba with shared projection layers (Bi Mamba-S) on protein sequences from Uni Ref50 with masked language modeling (MLM) objective. Results show favorable neural scaling laws, length extrapolation properties on Uni Ref90, and better downstream task performance on TAPE (Rao et al., 2019) and Protein Gym (Notin et al., 2024) than its Transformer counterpart, namely ESM-2. ... To evaluate if LC-PLM-G encodes graph relational information, we first conduct the graph-contextual protein language modeling on the PPI graph provided by ogbn-proteins dataset Hu et al. (2020).
Dataset Splits Yes We randomly sample 250,000 sequences from Uni Ref90 as the validation set to report evaluation losses for pretraining protein language models (p LMs). ... We split the Uni Ref90 sequences into 7 bins w.r.t. the sequence length (i.e. 0-128, 128-256, 256-512, 512-1024, 1024-2048, 2048-4096, and 4096-8192). ... For the training set, we down-sample 1.5% of protein chains used in Open Fold (Ahdritz et al., 2024), leading to 7,872 chains, with at most 1 protein chain from each cluster. ... We use 95% and 5% as data splitting for training and validation sets. For held-out test sets, we use CASP15-multimers (52 protein complexes), CASP14 (37 protein structures), and Benchmark2 (17 heterodimers structures) (Ghani et al., 2021).
Hardware Specification Yes All experiments are run on NVIDIA A100 Tensor Core GPU except ogbn-proteins and ogbl-ppa, which are run on NVIDIA A10G Tensor Core GPUs. ... We conducted an empirical inference time comparison between LC-PLM (740M) and ESM-2 (650M) on an NVIDIA L40S GPU.
Software Dependencies Yes For core software packages in main experiments, we use Python 3.10, Py Torch 2.1.0, Transformers 4.41.2, Deep Speed 0.14.4, Accelerate 0.27.2, mamba-ssm 2.2.0, datasets 2.20.0, Triton 2.0.0, and CUDA Toolkit 12.1. For some downstream tasks, the dependencies and package version will be adjusted accordingly. For ogbn-proteins and ogbl-ppa, we add several new packages: Py Torch Geometric 2.5.3, torch-cluster 1.6.3, torch-scatter 2.1.2, torch-sparse 0.6.18, torch-spline-conv 1.2.2, and OGB 1.3.6.
Experiment Setup Yes The training process is conducted using the Adam W optimizer with a learning rate that is linearly warmed up for a small percentage of the total training steps, followed by a cosine decay schedule. The batch size and learning rate are chosen based on the model size and computational resources, with a total number of tokens approximately equal to 0.5M and learning rates are set as 2e-4. Gradient clipping is often applied to stabilize the training. Additionally, a 0.1 weight decay is applied. ... We summarize the key hyperparameters in table 7. (Table 7 lists Peak learning rate, Global batch size, Block size, Warm-up steps, Adam betas, Maximum gradient norm, Precision, Optimizer, Learning rate scheduler, Weight decay, Length of random walks, Number of walks, Return parameter p, In-out parameter q, Hidden size, Number of Bi Mamba-S blocks).