Structure Language Models for Protein Conformation Generation
Authors: Jiarui Lu, Xiaoyin Chen, Stephen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, Jian Tang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across various conformation generation scenarios demonstrate the state-of-the-art performance of SLM including the representative ESMDiff model, achieving orders of magnitude faster speeds compared to existing generative methods. ... 5 EXPERIMENTS |
| Researcher Affiliation | Academia | 1Mila Que bec AI Institute, 2Universit e de Montr eal, 3Mc Gill University, 4University of Ottawa, 5National Research Council Canada, 6CIFAR AI Chair, 7HEC Montr eal |
| Pseudocode | Yes | Algorithm 1 Inference: Conformation Generation of SLM... Algorithm 2 Masked diffusion fine-tuning of ESM3... Algorithm 3 Iterative Decoding with Positional Ranking... Algorithm 4 DDPM Ancestral Sampling for Conditional Masked Diffusion... Algorithm 5 Round-Trip Diffusion for Conformation Inpainting |
| Open Source Code | Yes | Code available at https://github.com/lujiarui/esmdiff. ... The source training and inference code for structure language models in this study are made publicly available at https://github.com/lujiarui/esmdiff |
| Open Datasets | Yes | The training data for structure language models are controlled to contain only PDB entries on or before May 1st, 2020. ... simulation dynamics of BPTI (Shaw et al., 2010), ... conformational changing pairs including the foldswitching (Chakravarty & Porter, 2022) and ligand-induced apo/holo states (Salda no et al., 2022), and (3) intrinsically disordered proteins (IDPs) deposited in the protein ensemble database (PED) (Lazar et al., 2021). ... ATLAS MD ensemble dataset (Vander Meersche et al., 2024) |
| Dataset Splits | No | The training data for structure language models are controlled to contain only PDB entries on or before May 1st, 2020. ... The training set is further filtered to include all monomeric structures with a max resolution of 5.0 A, length ranging from 10 to 1000, which forms a total size of |D| = 112.4k as the training data. ... We curated the test set for intrinsically disordered proteins (IDPs) by downloading data from the Protein Ensemble Database (PED) (Lazar et al., 2021) on August 10, 2024 ... The final tested model and the reported epoch number in Table S1 come from the best checkpoint, selected according to the NLL of structure tokens on a hold-out validation set. |
| Hardware Specification | Yes | The profiling is carried out on a single NVIDIA A100 SXM4 GPU with 40GB memory ... The inference is conducted in a very efficient speed by only taking 5.3 0.3 seconds on a single NVIDIA A100-SXM4-40GB GPU |
| Software Dependencies | Yes | For MSA subsampling, we leverage the official repository of Del Alamo et al. (2022) under Alpha Fold v2.3.2 |
| Experiment Setup | Yes | Table S1 shows the hyperparameter settings for training. The total number of trainable parameters is 384M. The model is trained without learning rate scheduler for up to 30 epochs. ... Table S2 shows the hyperparameter settings for training. ... The masked diffusion is scheduled with a log linear noise schedule ... We use the Adam W optimizer with betas=(0.9, 0.999) and weight decay=0.01. The training is scheduled with the constant scheduler with warm up steps of 2,500. ... we use a sampling schedule of 25 steps and set the sampling temperature to be 1.0 without specially pointing out. During sampling, we also adopted the nucleus (top-p) sampling strategy (Holtzman et al., 2019) to improve the quality of samples with a probability threshold of 0.95. |