ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids

Authors: Hannes Stärk, Bowen Jing, Tomas Geffner, Jason Yim, Tommi Jaakkola, Arash Vahdat, Karsten Kreis

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) on the dataset and splits supplied by the authors, where ellipsoid spatial layouts are obtained for each protein as described in Section 3.1. We train only on the joint unconditional modeling task (i.e., no motif scaffolding, inverse folding, or forward folding). At inference time, we employ selfconditioning, rotational annealing, and 500 inference steps as described in Campbell et al. (2024). Throughout our experiments, we consider three sources of ellipsoid layouts: PDB proteins from the Multiflow validation set (data ellipsoids), ellipsoids drawn from our statistical model (Section 3.4; synthetic ellipsoids), and manually specified ellipsoids. The key feature of data ellipsoids is that they are associated with ground-truth proteins, providing an oracle generator for ellipsoid adherence. When using data ellipsoids, we sample proteins of equal lengths to the ground-truth proteins, while for novel ellipsoids, the protein length is the sum of ellipsoid residue counts, P k nk. Summary statistics about both sources of ellipsoids are described in Appendix A.
Researcher Affiliation Collaboration 1NVIDIA, 2CSAIL, Massachusetts Institute of Technology
Pseudocode Yes Algorithm 1: Invariant Cross Attention Algorithm 2: Update Block Algorithm 3: Residue Segmentation and Gaussian Fitting
Open Source Code Yes Code is available at https://github.com/NVlabs/protcomposer. All code for this paper is available at https://github.com/NVlabs/protcomposer
Open Datasets Yes The training data consists of PDB proteins and synthetic data. The PDB training colleted by Yim et al. (2023b) consists of 18684 proteins of length 60-384.
Dataset Splits Yes Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) on the dataset and splits supplied by the authors
Hardware Specification Yes Training is carried out on 8 NVIDIA A100 GPUs for 20 hours, corresponding to 83 epochs.
Software Dependencies No The paper mentions several software components like Multiflow (Campbell et al., 2024), Protein MPNN (Dauparas et al., 2022), and ESMFold (Lin et al., 2023), and Chroma (Ingraham et al., 2023) but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) ... and use their optimizers, data filtering, losses, and hyperparameters (Adam W, learning rate 0.0001). ... At inference time, we employ selfconditioning, rotational annealing, and 500 inference steps as described in Campbell et al. (2024). For guidance, we use the pretrained Multiflow checkpoint as the unconditional model. ... We run Multiflow at the rotational annealing strenghts [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4, 1.6, 2, 3, 4, 5, 6, 10] ... For Chroma (Ingraham et al., 2023) ... The inversesampling temperatures that we sweep over are [1, 1.4, 2, 4, 8, 10, 15, 20, 40, 80] ... For running RFdiffusion (Watson et al., 2023) ... The sampling temperatures that we sweep over are [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.8, 2, 2.4, 2.8, 3.2, 4, 4.8, 6]