ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids
Authors: Hannes Stärk, Bowen Jing, Tomas Geffner, Jason Yim, Tommi Jaakkola, Arash Vahdat, Karsten Kreis
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) on the dataset and splits supplied by the authors, where ellipsoid spatial layouts are obtained for each protein as described in Section 3.1. We train only on the joint unconditional modeling task (i.e., no motif scaffolding, inverse folding, or forward folding). At inference time, we employ selfconditioning, rotational annealing, and 500 inference steps as described in Campbell et al. (2024). Throughout our experiments, we consider three sources of ellipsoid layouts: PDB proteins from the Multiflow validation set (data ellipsoids), ellipsoids drawn from our statistical model (Section 3.4; synthetic ellipsoids), and manually specified ellipsoids. The key feature of data ellipsoids is that they are associated with ground-truth proteins, providing an oracle generator for ellipsoid adherence. When using data ellipsoids, we sample proteins of equal lengths to the ground-truth proteins, while for novel ellipsoids, the protein length is the sum of ellipsoid residue counts, P k nk. Summary statistics about both sources of ellipsoids are described in Appendix A. |
| Researcher Affiliation | Collaboration | 1NVIDIA, 2CSAIL, Massachusetts Institute of Technology |
| Pseudocode | Yes | Algorithm 1: Invariant Cross Attention Algorithm 2: Update Block Algorithm 3: Residue Segmentation and Gaussian Fitting |
| Open Source Code | Yes | Code is available at https://github.com/NVlabs/protcomposer. All code for this paper is available at https://github.com/NVlabs/protcomposer |
| Open Datasets | Yes | The training data consists of PDB proteins and synthetic data. The PDB training colleted by Yim et al. (2023b) consists of 18684 proteins of length 60-384. |
| Dataset Splits | Yes | Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) on the dataset and splits supplied by the authors |
| Hardware Specification | Yes | Training is carried out on 8 NVIDIA A100 GPUs for 20 hours, corresponding to 83 epochs. |
| Software Dependencies | No | The paper mentions several software components like Multiflow (Campbell et al., 2024), Protein MPNN (Dauparas et al., 2022), and ESMFold (Lin et al., 2023), and Chroma (Ingraham et al., 2023) but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) ... and use their optimizers, data filtering, losses, and hyperparameters (Adam W, learning rate 0.0001). ... At inference time, we employ selfconditioning, rotational annealing, and 500 inference steps as described in Campbell et al. (2024). For guidance, we use the pretrained Multiflow checkpoint as the unconditional model. ... We run Multiflow at the rotational annealing strenghts [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4, 1.6, 2, 3, 4, 5, 6, 10] ... For Chroma (Ingraham et al., 2023) ... The inversesampling temperatures that we sweep over are [1, 1.4, 2, 4, 8, 10, 15, 20, 40, 80] ... For running RFdiffusion (Watson et al., 2023) ... The sampling temperatures that we sweep over are [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.8, 2, 2.4, 2.8, 3.2, 4, 4.8, 6] |