Proteina: Scaling Flow-based Protein Structure Generative Models

Authors: Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, Karsten Kreis

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, Prote ına achieves state-of-the-art protein backbone generation performance, vastly outperforming all baselines especially in long chain synthesis, and we demonstrate superior control compared to previous models through our novel fold class conditioning. (...) In Tab. 1, we compare our models performance with baselines for protein backbone generation (see Sec. 2).
Researcher Affiliation Collaboration Tomas Geffner1,* Kieran Didi1,* Zuobai Zhang1,2,3, ,* Danny Reidenbach1 Zhonglin Cao1 Jason Yim1,4 Mario Geiger1 Christian Dallago1 Emine Kucukbenli1 Arash Vahdat1 Karsten Kreis1,* 1NVIDIA 2Mila Qu ebec AI Institute 3Universit e de Montr eal 4Massachusetts Institute of Technology
Pseudocode Yes Algorithm 1 Euler-Maruyama numerical simulation scheme Input: Number of steps N Input: Discretization of the unit interval 0 = t0 < t1 < t2 < ... < t N = 1 Input: Stochasticity schedule g(t) Input: Noise scaling parameter γ Input: Conditioning variables c
Open Source Code Yes For model and code release, please see Prote ına s Git Hub repository https://github.com/ NVIDIA-Digital-Bio/proteina/ as well as our project page https://research. nvidia.com/labs/genair/proteina/.
Open Datasets Yes Most protein structure generators have been trained on natural proteins, using filtered subsets of the PDB (Berman et al., 2000), resulting in training set sizes in the order of 20k. Recently, some works (Lin et al., 2024; Huguet et al., 2024; Qu et al., 2024) relied on the AFDB (Varadi et al., 2021) and in- (...) We use The Encyclopedia of Domains (TED) data, which consists of structural domain assignments to proteins in the AFDB (Lau et al., 2024b;a). TED uses the CATH structural hierarchy (Dawson et al., 2016)
Dataset Splits Yes The dataset is randomly divided into training, validation, and test sets at a ratio of 8:1:1, ensuring that at least one protein from each class is included in the test set whenever possible. (...) We further clustered the data with MMseqs2 (Steinegger & S oding, 2017) using a 50% sequence similarity threshold. During training, we sample clusters uniformly, and draw random structures within.
Hardware Specification Yes This is run on an A6000-48GB GPU for comparison with previous works (Lin et al., 2024). See Tab. 8 and Fig. 15. 2. For all tested models, we determine the largest supported batch size that fits into GPU memory and does not result in out-of-memory errors. This is executed on an A100-80GB GPU. See Tab. 9.
Software Dependencies No In Py Torch code (Paszke et al., 2019), we get [t0, t1, ..., t N] by the following three steps (...) We use Biotite s (Kunzmann & Hamacher, 2018) implementation of the P-SEA algorithm (Labesse et al., 1997) to analyze the secondary structure content of designable backbones. (...) Protein MPNN (Dauparas et al., 2022) with a sampling temperature of 0.1. We then predict a structure for each sequence using ESMFold (Lin et al., 2023) and calculate the root mean square deviation (RMSD) between each predicted structure and the model s original structure. (...) We perform clustering using Foldseek (van Kempen et al., 2024).
Experiment Setup Yes Table 16: Hyperparameters for Prote ına model training. Hyperparameter Pre-training Fine-tuning MFS Mno-tri FS M21M MLo RA Mlong Prote ına Architecture initialization random random random MFS Mno-tri FS sequence repr dim 768 768 1024 768 768 # registers 10 10 10 10 10 sequence cond dim 512 512 512 512 512 t sinusoidal enc dim 256 256 256 256 256 idx. sinusoidal enc dim 128 128 128 128 128 fold emb dim 256 256 256 256 256 pair repr dim 512 512 512 512 512 seq separation dim 128 128 128 128 128 pair distances dim (xt) 64 64 64 64 64 pair distances dim (ˆx(xt)) 128 128 128 128 128 pair distances min ( A) 1 1 1 1 1 pair distances max ( A) 30 30 30 30 30 # attention heads 12 12 16 12 12 # tranformer layers 15 15 18 15 15 # triangle layers 5 4 5 # trainable parameters 200M 200M 400M 7M 200M Prote ına Training # steps 200K 360K 180K 11K 220K/80K batch size per GPU 4 10 4 6 2/1 # GPUs 128 96 128 32 128 # grad. acc. steps 1 1 1 2 1/2