A Variational Perspective on Generative Protein Fitness Optimization
Authors: Lea Bogensperger, Dominik Narnhofer, Ahmed Allam, Konrad Schindler, Michael Krauthammer
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate VLGPO on two public benchmarks for protein fitness optimization in limited data regimes, namely Adeno-Associated Virus (AAV) (Bryant et al., 2021) and Green Fluorescent Protein (GFP) (Sarkisyan et al., 2016), as suggested by (Kirjner et al., 2023). ... We perform fitness optimization in a continuous latent representation... We demonstrate state-of-the-art performance on established benchmarks for protein fitness optimization, namely AAV and GFP... We conduct an ablation study on the influence of manifold constrained gradients in sampling (Line 7, Algorithm 1). |
| Researcher Affiliation | Academia | 1University of Zurich 2ETH Zurich. Correspondence to: Lea Bogensperger <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 VLGPO sampling |
| Open Source Code | Yes | Source code available at https://github.com/uzh-dqbm-cmi/VLGPO. |
| Open Datasets | Yes | We validate VLGPO on two public benchmarks for protein fitness optimization in limited data regimes, namely Adeno-Associated Virus (AAV) (Bryant et al., 2021) and Green Fluorescent Protein (GFP) (Sarkisyan et al., 2016), as suggested by (Kirjner et al., 2023). |
| Dataset Splits | No | The paper defines tasks based on fitness percentile ranges and mutation gaps (Table 1) and lists the number of data samples N for each task (Table 2), which are used for training VAE and flow matching models in a limited data setting. However, it does not explicitly provide specific train/validation/test splits (e.g., percentages or exact counts) for the models being developed in the paper. The oracle gψ is trained on the complete DMS data, but this is for evaluation, not for the VLGPO model's training and evaluation splits. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | No | The paper mentions using a '1D CNN commonly used for denoising diffusion probabilistic models (DDPMs)' and links to a GitHub repository, but it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | A learning rate of 0.001 with a convolutional architecture and β {0.01, 0.001} for AAV and GFP is used for training the encoder E and decoder D in Equation (4)... A learning rate of 5e-5 and a batch size of 1024 were used to train vθ,t for 1000 epochs. ... K = 32 ODE steps... The parameters αt {0.97, 1.2, 0.56} and J {39, 19, 37} for AAV (medium), AAV (hard) and GFP (medium). |