PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design
Authors: Zhenqiao Song, Tianxiao Li, Lei Li, Martin Renqiang Min
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we first describe our PPBench construction process in Section 4.1 and experimental setup in Section 4.2. Then we conduct extensive experiments on a General Protein-Protein Complex Design task (Section 4.3) and two real-world applications, Target-Protein Mini-Binder Complex Design (Section 4.4) and Antigen-Antibody Complex Design (Section 4.5). |
| Researcher Affiliation | Collaboration | 1Language Technologies Institute, Carnegie Mellon University 2NEC Laboratories America. |
| Pseudocode | No | The paper describes the architecture and processes using mathematical formulations and descriptive text, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper states: "The top-1 candidates are uploaded in the supplementary material." This refers to data, not the source code for the methodology. There is no explicit statement about releasing the code for PPDiff or any associated repository link. |
| Open Datasets | Yes | To assess PPDiff, we curate PPBench, a general proteinprotein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBench and finetuned on two real-world applications: target-protein minibinder complex design and antigen-antibody complex design. Following Jin et al. (2022), we curate antigen-antibody complexes from the Structural Antibody Database (Dunbar et al. (2014), SAb Dab) |
| Dataset Splits | Yes | To prepare the dataset for training and evaluation, we designate 10 clusters each for validation and testing, with the remaining clusters reserved for training. To ensure efficient processing, training data are further filtered to exclude sequences longer than 1,024 residues, while validation and testing data are restricted to sequences no longer than 512 residues, finally resulting in a total of 706,360 complexes. For categories containing more than 50 complexes, an 8:1:1 random split is applied for training, validation, and test sets. The dataset is split into training, validation, and test sets based on the clustering of CDRs... Clusters are then randomly divided into training, validation, and test sets in an 8:1:1 ratio. |
| Hardware Specification | Yes | PPDiff is trained for 1,000,000 steps on a single NVIDIA RTX A6000 GPU using the Adam optimizer (Kingma, 2014). |
| Software Dependencies | No | The paper mentions using "the pretrained 650M ESM-2 model (Lin et al., 2022)" and "Adam optimizer (Kingma, 2014)", but it does not provide specific version numbers for software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | The embedding dimensionality is configured at 1,280. PPDiff is trained for 1,000,000 steps on a single NVIDIA RTX A6000 GPU using the Adam optimizer (Kingma, 2014). The batch size is set to 1,024 tokens, and the learning rate is initialized at 5e-6 . For structure diffusion, the starting and ending values of β are set to 1e-7 and 2e-3, respectively, with a variance schedule of 2. The cosine schedule offset for sequence diffusion is configured at 0.01. The number of k-nearest neighbors is fixed at 32. Additionally, a learning rate warm-up is applied over the first 4,000 steps to stabilize the training process. |