reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Structure Language Models for Protein Conformation Generation

Authors: Jiarui Lu, Xiaoyin Chen, Stephen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, Jian Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across various conformation generation scenarios demonstrate the state-of-the-art performance of SLM including the representative ESMDiff model, achieving orders of magnitude faster speeds compared to existing generative methods. ... 5 EXPERIMENTS
Researcher Affiliation	Academia	1Mila Que bec AI Institute, 2Universit e de Montr eal, 3Mc Gill University, 4University of Ottawa, 5National Research Council Canada, 6CIFAR AI Chair, 7HEC Montr eal
Pseudocode	Yes	Algorithm 1 Inference: Conformation Generation of SLM... Algorithm 2 Masked diffusion fine-tuning of ESM3... Algorithm 3 Iterative Decoding with Positional Ranking... Algorithm 4 DDPM Ancestral Sampling for Conditional Masked Diffusion... Algorithm 5 Round-Trip Diffusion for Conformation Inpainting
Open Source Code	Yes	Code available at https://github.com/lujiarui/esmdiff. ... The source training and inference code for structure language models in this study are made publicly available at https://github.com/lujiarui/esmdiff
Open Datasets	Yes	The training data for structure language models are controlled to contain only PDB entries on or before May 1st, 2020. ... simulation dynamics of BPTI (Shaw et al., 2010), ... conformational changing pairs including the foldswitching (Chakravarty & Porter, 2022) and ligand-induced apo/holo states (Salda no et al., 2022), and (3) intrinsically disordered proteins (IDPs) deposited in the protein ensemble database (PED) (Lazar et al., 2021). ... ATLAS MD ensemble dataset (Vander Meersche et al., 2024)
Dataset Splits	No	The training data for structure language models are controlled to contain only PDB entries on or before May 1st, 2020. ... The training set is further filtered to include all monomeric structures with a max resolution of 5.0 A, length ranging from 10 to 1000, which forms a total size of \|D\| = 112.4k as the training data. ... We curated the test set for intrinsically disordered proteins (IDPs) by downloading data from the Protein Ensemble Database (PED) (Lazar et al., 2021) on August 10, 2024 ... The final tested model and the reported epoch number in Table S1 come from the best checkpoint, selected according to the NLL of structure tokens on a hold-out validation set.
Hardware Specification	Yes	The profiling is carried out on a single NVIDIA A100 SXM4 GPU with 40GB memory ... The inference is conducted in a very efficient speed by only taking 5.3 0.3 seconds on a single NVIDIA A100-SXM4-40GB GPU
Software Dependencies	Yes	For MSA subsampling, we leverage the official repository of Del Alamo et al. (2022) under Alpha Fold v2.3.2
Experiment Setup	Yes	Table S1 shows the hyperparameter settings for training. The total number of trainable parameters is 384M. The model is trained without learning rate scheduler for up to 30 epochs. ... Table S2 shows the hyperparameter settings for training. ... The masked diffusion is scheduled with a log linear noise schedule ... We use the Adam W optimizer with betas=(0.9, 0.999) and weight decay=0.01. The training is scheduled with the constant scheduler with warm up steps of 2,500. ... we use a sampling schedule of 25 steps and set the sampling temperature to be 1.0 without specially pointing out. During sampling, we also adopted the nucleus (top-p) sampling strategy (Holtzman et al., 2019) to improve the quality of samples with a probability threshold of 0.95.