reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids

Authors: Hannes Stärk, Bowen Jing, Tomas Geffner, Jason Yim, Tommi Jaakkola, Arash Vahdat, Karsten Kreis

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) on the dataset and splits supplied by the authors, where ellipsoid spatial layouts are obtained for each protein as described in Section 3.1. We train only on the joint unconditional modeling task (i.e., no motif scaffolding, inverse folding, or forward folding). At inference time, we employ selfconditioning, rotational annealing, and 500 inference steps as described in Campbell et al. (2024). Throughout our experiments, we consider three sources of ellipsoid layouts: PDB proteins from the Multiflow validation set (data ellipsoids), ellipsoids drawn from our statistical model (Section 3.4; synthetic ellipsoids), and manually specified ellipsoids. The key feature of data ellipsoids is that they are associated with ground-truth proteins, providing an oracle generator for ellipsoid adherence. When using data ellipsoids, we sample proteins of equal lengths to the ground-truth proteins, while for novel ellipsoids, the protein length is the sum of ellipsoid residue counts, P k nk. Summary statistics about both sources of ellipsoids are described in Appendix A.
Researcher Affiliation	Collaboration	1NVIDIA, 2CSAIL, Massachusetts Institute of Technology
Pseudocode	Yes	Algorithm 1: Invariant Cross Attention Algorithm 2: Update Block Algorithm 3: Residue Segmentation and Gaussian Fitting
Open Source Code	Yes	Code is available at https://github.com/NVlabs/protcomposer. All code for this paper is available at https://github.com/NVlabs/protcomposer
Open Datasets	Yes	The training data consists of PDB proteins and synthetic data. The PDB training colleted by Yim et al. (2023b) consists of 18684 proteins of length 60-384.
Dataset Splits	Yes	Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) on the dataset and splits supplied by the authors
Hardware Specification	Yes	Training is carried out on 8 NVIDIA A100 GPUs for 20 hours, corresponding to 83 epochs.
Software Dependencies	No	The paper mentions several software components like Multiflow (Campbell et al., 2024), Protein MPNN (Dauparas et al., 2022), and ESMFold (Lin et al., 2023), and Chroma (Ingraham et al., 2023) but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup	Yes	Starting from the publicly available pre-trained checkpoint, we fine-tune Multiflow (Campbell et al., 2024) ... and use their optimizers, data filtering, losses, and hyperparameters (Adam W, learning rate 0.0001). ... At inference time, we employ selfconditioning, rotational annealing, and 500 inference steps as described in Campbell et al. (2024). For guidance, we use the pretrained Multiflow checkpoint as the unconditional model. ... We run Multiflow at the rotational annealing strenghts [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4, 1.6, 2, 3, 4, 5, 6, 10] ... For Chroma (Ingraham et al., 2023) ... The inversesampling temperatures that we sweep over are [1, 1.4, 2, 4, 8, 10, 15, 20, 40, 80] ... For running RFdiffusion (Watson et al., 2023) ... The sampling temperatures that we sweep over are [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.8, 2, 2.4, 2.8, 3.2, 4, 4.8, 6]