PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design

Authors: Zhenqiao Song, Tianxiao Li, Lei Li, Martin Renqiang Min

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we first describe our PPBench construction process in Section 4.1 and experimental setup in Section 4.2. Then we conduct extensive experiments on a General Protein-Protein Complex Design task (Section 4.3) and two real-world applications, Target-Protein Mini-Binder Complex Design (Section 4.4) and Antigen-Antibody Complex Design (Section 4.5).
Researcher Affiliation Collaboration 1Language Technologies Institute, Carnegie Mellon University 2NEC Laboratories America.
Pseudocode No The paper describes the architecture and processes using mathematical formulations and descriptive text, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper states: "The top-1 candidates are uploaded in the supplementary material." This refers to data, not the source code for the methodology. There is no explicit statement about releasing the code for PPDiff or any associated repository link.
Open Datasets Yes To assess PPDiff, we curate PPBench, a general proteinprotein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBench and finetuned on two real-world applications: target-protein minibinder complex design and antigen-antibody complex design. Following Jin et al. (2022), we curate antigen-antibody complexes from the Structural Antibody Database (Dunbar et al. (2014), SAb Dab)
Dataset Splits Yes To prepare the dataset for training and evaluation, we designate 10 clusters each for validation and testing, with the remaining clusters reserved for training. To ensure efficient processing, training data are further filtered to exclude sequences longer than 1,024 residues, while validation and testing data are restricted to sequences no longer than 512 residues, finally resulting in a total of 706,360 complexes. For categories containing more than 50 complexes, an 8:1:1 random split is applied for training, validation, and test sets. The dataset is split into training, validation, and test sets based on the clustering of CDRs... Clusters are then randomly divided into training, validation, and test sets in an 8:1:1 ratio.
Hardware Specification Yes PPDiff is trained for 1,000,000 steps on a single NVIDIA RTX A6000 GPU using the Adam optimizer (Kingma, 2014).
Software Dependencies No The paper mentions using "the pretrained 650M ESM-2 model (Lin et al., 2022)" and "Adam optimizer (Kingma, 2014)", but it does not provide specific version numbers for software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes The embedding dimensionality is configured at 1,280. PPDiff is trained for 1,000,000 steps on a single NVIDIA RTX A6000 GPU using the Adam optimizer (Kingma, 2014). The batch size is set to 1,024 tokens, and the learning rate is initialized at 5e-6 . For structure diffusion, the starting and ending values of β are set to 1e-7 and 2e-3, respectively, with a variance schedule of 2. The cosine schedule offset for sequence diffusion is configured at 0.01. The number of k-nearest neighbors is fixed at 32. Additionally, a learning rate warm-up is applied over the first 4,000 steps to stabilize the training process.