Learning the Language of Protein Structure
Authors: Jérémie DONA, Benoit Gaujac, Timothy Atkinson, Liviu Copoiu, Thomas Pierrot, Thomas D Barrett
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contributions are threefold. First, we introduce a series of quantized autoencoders that effectively discretize protein structures into sequences of tokens while preserving the necessary information for accurate reconstruction. Second, we validate our autoencoders through qualitative and quantitative analysis, and various ablation studies, supporting our design choices. Third, we demonstrate the efficacy and practicality of the learned representations with experimental results from a simple GPT model trained on our learned codebook, which successfully generates novel, diverse, and structurally viable protein structures. |
| Researcher Affiliation | Industry | Benoit Gaujac*,1, Jérémie Donà*,1, Liviu Copoiu1, Timothy Atkinson1, Thomas Pierrot1 and Thomas D. Barrett1 *Equal contributions: EMAIL, 1Insta Deep |
| Pseudocode | Yes | Algorithm 1 Finite Scalar Quantization Algorithm 2 Resampling Layer with Positional Encoding Algorithm 3 Pairwise Module Algorithm 4 Overall Algorithm Pseudo-Code |
| Open Source Code | Yes | We release all experimental code at https://github.com/instadeepai/protein-structure-tokenizer/ and the trained model weights at https://huggingface.co/Insta Deep AI/protein-structure-tokenizer/. |
| Open Datasets | Yes | We use approximately 310000 entries available in the Protein Data Bank (PDB) (Berman et al., 2000) as training data. For the reference dataset, we use the s40 CATH dataset (Orengo et al., 1997), publicly available at ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/non-redundant-data-sets/ cath-dataset-nonredundant-S40.pdb.tgz. |
| Dataset Splits | Yes | We randomly select 90% of the clusters for training and use the remaining as test set. Amongst these 10% withheld protein-structure clusters, we retain 20% for validation, the remaining 80% being used for test. |
| Hardware Specification | Yes | train the model for 100 epochs on 8 TPU v4-8 with a batch size of 128. |
| Software Dependencies | No | The paper mentions using Adam W (Loshchilov and Hutter, 2019) for optimization and refers to implementations of MPNN (Dauparas et al., 2022) and AlphaFold-2's structure module (Jumper et al., 2021). However, it does not provide specific version numbers for software libraries, programming languages, or other key software components used to implement the methodology described in this paper. |
| Experiment Setup | Yes | The optimization is carried out using Adam W (Loshchilov and Hutter, 2019) with β1 = 0.9, β2 = 0.95 and a weight decay of 0.1. We use a learning rate warm-up scheduler, progressively increasing the learning rate from 10 6 to 10 3 over the first 1000 steps, and train the model for 100 epochs on 8 TPU v4-8 with a batch size of 128. With such hyperparameters, the autoencoder model has 4.5M parameters, and the training lasts 32 hours on a TPU v4-8, which amounts to a total number of FLOP comprised between 7 1019 and 1020. For the encoder, we use a 3 layers message passing neural network following the architecture and implementation proposed in Dauparas et al. (2022) and utilize the swish activation function. The graph sparsity is set to 50 neighbors per residue. When the downsampling ratio is r > 1, the resampling operation consists of a stack of 3 resampling layers as described in Algorithm 2, the initial queries being defined as positional encodings. We strictly follow the implementation of Alpha Fold-2 (Jumper et al., 2021) regarding the structure module and use 6 structure layers. GPT hyperparameters: We use a standard decoder only transformer following the implementation of (Hoffmann et al., 2022) with pre-layer normalization and a dropout rate of 10% during training. We follow Hoffmann et al. (2022) for the parameters choice with 20 layers, 16 heads per layers, a model dimension of 1024 and a query size of 64, resulting in a model with 344M parameters. |