Fast Uncovering of Protein Sequence Diversity from Structure

Authors: Luca Alessandro Silva, Barthelemy Meynard-Piganeau, Carlo Lucibello, Christoph Feinauer

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that this increased diversity in sampled sequences translates into greater variability in biochemical properties, highlighting the exciting potential of our method for applications such as protein design. The orders of magnitude improvement in sampling speed compared to existing methods unlocks new possibilities for high-throughput virtual screening. We extensively validate Inv MSAFold through numerous out-of-sample tests, demonstrating its ability to generate sequences that deviate significantly from the native sequence while maintaining structural fidelity and capturing the evolutionary patterns of the MSAs. Additionally, we show that the broader exploration of the sequence domain enables the sampling of a wider range of protein properties of interest.
Researcher Affiliation Academia Luca Alessandro Silva Department of Computing Sciences Bocconi University Milan, MI 20100, Italy EMAIL Barthelemy Meynard-Piganeau Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB Sorbonne Universit e Paris, France EMAIL Carlo Lucibello Department of Computing Sciences BIDSA Bocconi University Milan, MI 20100, Italy EMAIL Christoph Feinauer Department of Computing Sciences Bocconi University Milan, MI 20100, Italy EMAIL
Pseudocode No The paper describes methods using mathematical equations and prose but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 2.1 'THE INVMSAFOLD ARCHITECTURE' describes the architecture and components, and Section 2.2 'INVMSAFOLD-PW' describes the probabilistic model, but these are all in text and equations without a structured pseudocode format.
Open Source Code Yes Codes to train the models and replicate some of the results can be found at the Potts Inverse Folding repository.
Open Datasets Yes We base these on the CATH database Sillitoe et al. (2021), which classifies protein domains into superfamilies and then further into clusters based on sequence homology. We use the non-redundant dataset of domains at 40% similarity and associate to every domain a cluster as indicated in the CATH database. Finally, we create MSAs for all sequences in the datasets using the MMseqs2 software and the Uniprot50 database.
Dataset Splits Yes In order to control the level of homology in our evaluation, we create three test sets, which we call respectively inter-cluster, intra-cluster, and MSA test set. We base these on the CATH database Sillitoe et al. (2021), which classifies protein domains into superfamilies and then further into clusters based on sequence homology. We then choose 10% of the sequence clusters uniformly at random and assign them to the inter-cluster test set, excluding these clusters from the training set. We then create the less stringent intra-cluster test set by taking from every domain sequence cluster that is not in the inter-cluster test set and has at least two domains a single random domain. We then use the remaining domains as the training set. Finally, we create MSAs for all sequences in the datasets using the MMseqs2 software and the Uniprot50 database. We further split the sequences in the MSAs into 90% used for training and 10% for the MSA test set. The resulting sizes of the datasets are the following: 22468 for the training set, 22428 for the MSA, 1374 for the intra-cluster, and 2673 for inter-cluster test sets respectively.
Hardware Specification Yes To sample from ESM-IF1 we utilized a NVIDIA Ge Force RTX 4060 Laptop having 8Gb of memory, while for Inv MSAFold-AR we used a single core of a i9-13905H processor.
Software Dependencies No The paper mentions several software tools and libraries used (e.g., ESM-IF1 encoder, Alpha Fold 2, Thermoprot, Protein-Sol, MMseqs2, Py HMMER, Adam W optimizer, Optuna). However, it does not specify explicit version numbers for these software dependencies, which is required for a reproducible description.
Experiment Setup Yes For Inv MSAFold-PW, we train with a single structure in each batch, with a MSA subsample size for MX of 64, a rank K of 48, a learning rate of 10 4 and L2 regularization constants of λh = λJ = 10 4 for fields and couplings. For Inv MSAFold-AR, we tune the hyperparameters as discussed in Appendix A.2.2 for the details. Both models are trained with Adam W optimizer for a total of 94 epochs. Hypertuning results Model Dropout B M K (λJ, λh) lr Ar DCA 0.1 8 32 48 (3.2e-6, 5.0e-5) 3.4e-4