Distilling Structural Representations into Protein Sequence Models

Authors: Jeffrey Ouyang-Zhang, Chengyue Gong, Yue Zhao, Philipp Krähenbühl, Adam Klivans, Daniel Diaz

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental ISM outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. For example, on the CAMEO protein structure prediction benchmark ISM outperforms its ESM2 counterpart with a GDT-TS score of 0.67 versus 0.64 (see Table 1). For S669 !!G prediction, ISM surpasses ESM2 in AUC (0.76 vs 0.72) and even matches specialized models. We ablate key design decisions by reporting long-range Precision at L (P@L) for contact prediction, accuracy for secondary structure prediction, F1 for binding residue prediction, and Spearman correlation for !!G prediction in Table 4.
Researcher Affiliation Academia Jeffrey Ouyang-Zhang, Chengyue Gong, Yue Zhao, Philipp Krähenbühl, Adam R. Klivans, Daniel J. Diaz University of Texas at Austin EMAIL
Pseudocode No The paper describes methods and processes in text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/jozhang97/ISM.
Open Datasets Yes Our autoencoder training dataset contains 35K proteins from the Protein Data Bank(PDB). We extract per-residue microenvironment features for 5.8M proteins from Uniclust30 with Alpha Fold structures (Mirdita et al., 2017), along with 35K PDB proteins. We evaluate how effectively ISM predicts the impact of single mutations on a protein s thermodynamic stability (!!G) on the S669 dataset (Pancotti et al., 2022) in Table 2. We fine-tune on the c DNA117K dataset from Diaz et al. (2024), a subset of the c DNA display proteolysis dataset (Tsuboyama et al., 2023). We evaluate ISM on the PEER (Xu et al., 2022) and FLIP (Dallago et al., 2021) benchmarks
Dataset Splits Yes For contact, secondary structure, and binding residue prediction, the proteins in the training and test sets have at most 30% sequence similarity. Contact, secondary structure, and binding residue prediction are evaluated using sequence similarity splits of 30%, 25%, and 20% respectively.
Hardware Specification Yes Training takes 26 wall-clock hours on 32 GH200 GPUs.
Software Dependencies No The paper mentions specific optimizers (Adam W) but does not provide version numbers for any software libraries or dependencies used in the implementation.
Experiment Setup Yes We structure-tune the 650M parameter ESM2 for 20 epochs using a cosine learning rate schedule with 4 warmup epochs. We use a total batch size of 1536 proteins cropped to a maximum sequence length of 512 amino acids. We use Adam W optimizer with a learning rate of 1 10 4 and weight decay of 5 10 3.