Internal-Coordinate Density Modelling of Protein Structure: Covariance Matters
Authors: Marloes Arts, Jes Frellsen, Wouter Boomsma
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present a new strategy for modelling protein densities in internal coordinates, which uses constraints in 3D space to induce covariance structure between the internal degrees of freedom. We illustrate the potential of the procedure by constructing a variational autoencoder with full covariance output induced by the constraints implied by the conditional mean in 3D, and demonstrate that our approach makes it possible to scale density models of internal coordinates to full protein backbones in two settings: 1) a unimodal setting for proteins exhibiting small fluctuations and limited amounts of available data, and 2) a multimodal setting for larger conformational changes in a high data regime. Our method is validated in two regimes: a low data regime for proteins that exhibit small, unimodal fluctuations, and a high data regime for proteins that exhibit multimodal behavior. Sections 4 and 4.3 detail 'Experiments' and 'Internal-coordinate density modelling results' including metrics, test cases, and performance comparisons with figures (e.g., Ramachandran plots, variance along atom chain, TICA free energy landscapes) and tables in the appendix. |
| Researcher Affiliation | Academia | Marloes Arts EMAIL Department of Computer Science University of Copenhagen. Jes Frellsen EMAIL Department of Applied Mathematics and Computer Science Technical University of Denmark. Wouter Boomsma EMAIL Department of Computer Science University of Copenhagen. All authors are affiliated with universities (University of Copenhagen, Technical University of Denmark) and use academic email domains (.ku.dk, .dtu.dk). |
| Pseudocode | No | The paper describes the VAE model architecture and training process in detail, including equations for the loss function and a model overview diagram in Figure 3. However, it does not contain a dedicated section or figure explicitly labeled 'Pseudocode' or 'Algorithm', nor are the procedural steps formatted in a code-like block. |
| Open Source Code | Yes | The code base, NMR datasets and in-house generated MD data are available at this github repository: https://github.com/mearts/VAE_covariance_matters. |
| Open Datasets | Yes | The code base, NMR datasets and in-house generated MD data are available at this github repository: https://github.com/mearts/VAE_covariance_matters. Specifically, 1unc corresponds to the solution structure of the human villin C-terminal headpiece subdomain. This protein contains 36 residues with 108 backbone (N, Cα and C) atoms. This solution nuclear magnetic resonance (NMR) dataset is freely available from the Protein Data Bank and contains 25 conformers. 1fsd, a beta beta alpha (BBA) motif, is also a freely available NMR dataset containing 41 structures. |
| Dataset Splits | Yes | All datasets were split 90%-10% into a training and validation set, with the same split for our VAE and all baselines. |
| Hardware Specification | Yes | All models were trained using an Adam optimizer with a learning rate of 5e 4, on a Nvidia Quadro RTX (48GB) GPU. |
| Software Dependencies | Yes | All structure visualizations were done using Py MOL (Schrödinger, version 2.5.2). |
| Experiment Setup | Yes | The encoder and the decoder of the VAE are simple three-layer MLPs (multilayer perceptrons)... The MLP linear layer sizes of the encoder are [128, 64, 32], mapping to a 16-dimensional latent space, and layer sizes of the decoder are [32, 64, 128]... The weights for the κ-prior and auxiliary loss were explored with grid search (see Appendix D), values chosen for the models reported in the main paper are shown in Table A1 together with other experimental details. The model training starts with a warm-up phase in two different ways: 1) predicting µκ only, with Σ = I and 2) linearly increasing the weight of the KL-term from 0 to 1. Proteins in the low data regime (unimodal setting) have a 100 epoch mean-only warm-up and a 200 epoch KL warm-up, while proteins in the high data regime (multimodal setting) have a 3 epoch mean-only warm-up and an 8 epoch KL warm-up. All models were trained using an Adam optimizer with a learning rate of 5e 4... Table A1: Experimental details for test cases. (Includes # epochs, batch size, a, waux). |