reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning the Language of Protein Structure

Authors: Jérémie DONA, Benoit Gaujac, Timothy Atkinson, Liviu Copoiu, Thomas Pierrot, Thomas D Barrett

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions are threefold. First, we introduce a series of quantized autoencoders that effectively discretize protein structures into sequences of tokens while preserving the necessary information for accurate reconstruction. Second, we validate our autoencoders through qualitative and quantitative analysis, and various ablation studies, supporting our design choices. Third, we demonstrate the efficacy and practicality of the learned representations with experimental results from a simple GPT model trained on our learned codebook, which successfully generates novel, diverse, and structurally viable protein structures.
Researcher Affiliation	Industry	Benoit Gaujac,1, Jérémie Donà,1, Liviu Copoiu1, Timothy Atkinson1, Thomas Pierrot1 and Thomas D. Barrett1 *Equal contributions: EMAIL, 1Insta Deep
Pseudocode	Yes	Algorithm 1 Finite Scalar Quantization Algorithm 2 Resampling Layer with Positional Encoding Algorithm 3 Pairwise Module Algorithm 4 Overall Algorithm Pseudo-Code
Open Source Code	Yes	We release all experimental code at https://github.com/instadeepai/protein-structure-tokenizer/ and the trained model weights at https://huggingface.co/Insta Deep AI/protein-structure-tokenizer/.
Open Datasets	Yes	We use approximately 310000 entries available in the Protein Data Bank (PDB) (Berman et al., 2000) as training data. For the reference dataset, we use the s40 CATH dataset (Orengo et al., 1997), publicly available at ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/non-redundant-data-sets/ cath-dataset-nonredundant-S40.pdb.tgz.
Dataset Splits	Yes	We randomly select 90% of the clusters for training and use the remaining as test set. Amongst these 10% withheld protein-structure clusters, we retain 20% for validation, the remaining 80% being used for test.
Hardware Specification	Yes	train the model for 100 epochs on 8 TPU v4-8 with a batch size of 128.
Software Dependencies	No	The paper mentions using Adam W (Loshchilov and Hutter, 2019) for optimization and refers to implementations of MPNN (Dauparas et al., 2022) and AlphaFold-2's structure module (Jumper et al., 2021). However, it does not provide specific version numbers for software libraries, programming languages, or other key software components used to implement the methodology described in this paper.
Experiment Setup	Yes	The optimization is carried out using Adam W (Loshchilov and Hutter, 2019) with β1 = 0.9, β2 = 0.95 and a weight decay of 0.1. We use a learning rate warm-up scheduler, progressively increasing the learning rate from 10 6 to 10 3 over the first 1000 steps, and train the model for 100 epochs on 8 TPU v4-8 with a batch size of 128. With such hyperparameters, the autoencoder model has 4.5M parameters, and the training lasts 32 hours on a TPU v4-8, which amounts to a total number of FLOP comprised between 7 1019 and 1020. For the encoder, we use a 3 layers message passing neural network following the architecture and implementation proposed in Dauparas et al. (2022) and utilize the swish activation function. The graph sparsity is set to 50 neighbors per residue. When the downsampling ratio is r > 1, the resampling operation consists of a stack of 3 resampling layers as described in Algorithm 2, the initial queries being defined as positional encodings. We strictly follow the implementation of Alpha Fold-2 (Jumper et al., 2021) regarding the structure module and use 6 structure layers. GPT hyperparameters: We use a standard decoder only transformer following the implementation of (Hoffmann et al., 2022) with pre-layer normalization and a dropout rate of 10% during training. We follow Hoffmann et al. (2022) for the parameters choice with 20 layers, 16 heads per layers, a model dimension of 1024 and a query size of 64, resulting in a model with 344M parameters.