Dynamics-inspired Structure Hallucination for Protein-protein Interaction Modeling
Authors: Fang Wu, Stan Z. Li
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments on SKEMPI.v2 substantiate the superiority of Refine PPI over all existing tools for predicting free energy change. These findings underscore the effectiveness of our hallucination strategy and the PDC module in addressing the absence of mutant protein structure and modeling geometric uncertainty. |
| Researcher Affiliation | Academia | Fang Wu EMAIL Department of Computer Science Stanford University Stan Z. Li EMAIL School of Engineering Westlake University |
| Pseudocode | Yes | The whole paradigm illustrated in pseudo-code is put in Algorithm 1. Algorithm 1 The workflow of our Refine-PPI. Input: wild-type structure GWT, mutant site and amino acid types am and a m; backbone module hρ, refinement model fθ, head predictor gτ; number of recycles k, the real free energy change y, loss weight λ GWT 0 , GMT 0 Equation 1 GWT Initialize structures # Training-only for t = 0, 1, ..., k 1 do ZWT t hρ GWT t x WT t+1 fθ GWT t , ZWT t , x WT t , am end for Lrefine Equation 2 x WT k , x WT The MMM loss for t = 0, 1, ..., k 1 do ZMT t No grad. hρ GMT t x MT t+1 No grad. fθ GMT t , ZMT t , x MT t , a m end for ZWT, ZMT hρ GWT , hρ GMT k ˆy gτ ZWT, ZMT L G RMSE(ˆy, y) The G loss # Backpropagation ρ, θ, τ L G + λLrefine |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository for the methodology described in this paper. Mentions of open-source repositories are related to baselines used for comparison. |
| Open Datasets | Yes | Data Evaluation is carried out in SKEMPI.v2 (Jankauskait e et al., 2019). It contains data on changes in the thermodynamic parameters and kinetic rate constants after mutation for structurally resolved PPIs. The latest version contains manually curated binding data for 7,085 mutations. The dataset is split into 3 folds by structure, each containing unique protein complexes that do not appear in other folds. Two folds are used for train and validation, and the remaining fold is used for test. This yields 3 different sets of parameters and ensures that every data point in SKEMPI.v2 is tested once. The pretraining data is derived from PDBREDO, a database that contains refined X-ray structures in PDB. |
| Dataset Splits | Yes | The dataset is split into 3 folds by structure, each containing unique protein complexes that do not appear in other folds. Two folds are used for train and validation, and the remaining fold is used for test. This yields 3 different sets of parameters and ensures that every data point in SKEMPI.v2 is tested once. The pretraining data is derived from PDBREDO, a database that contains refined X-ray structures in PDB. The protein chains are clustered based on 50% sequence identity, leading to 38,413 chain clusters, which are randomly divided into the training, validation, and test sets by 95%/0.5%/4.5% respectively. |
| Hardware Specification | Yes | We implement all experiments on 4 A100 GPUs, each with 80G memory. |
| Software Dependencies | No | The paper mentions optimizers (Adam) and schedulers (Reduce LROn Plateau) but does not provide specific version numbers for these or other software libraries (e.g., PyTorch, TensorFlow, Python version) used for the implementation of Refine-PPI. Version numbers are only provided for baseline tools (Rosetta, Jackhmmer). |
| Experiment Setup | Yes | Refine-PPI is trained with an Adam optimizer without weight decay and with β1 = 0.9 and β2 = 0.999. A Reduce LROn Plateau scheduler is employed to automatically adjust the learning rate with a patience of 10 epochs and a minimum learning rate of 1.e 6. The batch size is set to 64 and an initial learning rate of 1.e 4. The maximum iterations are 50K and the validation frequency is 1K iterations. The node dimension is 128, and no dropout is conducted. As for the structure refinement, the recycle number is set as 3, and the balance weight is tuned as 1.0. We performed a grid search to find the optimal length of the masked region and found that l = r = 5 is a good choice. However, different initializations require different optimal hyperparameters, and typically we can mask longer regions for denoising-based MMM. The pretraining follows a similar training scheme with a batch size of 32. |