Protein Language Model Fitness is a Matter of Preference
Authors: Cade Gordon, Amy Lu, Pieter Abbeel
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work studies trends observed over hundreds of deep mutational scans across multiple different fitness objectives. We find that the likelihood, or abstractly, implicit preference of a certain protein sequence imbued during pretraining is predictive of fitness prediction capabilities. Both over-preferred and under-preferred wild type sequences harm performance. Using influence functions to causally understand how individual data points increase protein likelihoods, we find that there exists a power law tail due to sequence homology. Lastly, under-performance on low likelihood wild type proteins can be remedied by unsupervised finetuning. These findings that p LM zero-shot fitness estimation can be predicted by the likelihood of the engineered sequence can motivate and improve p LMs deployment in protein maturation campaigns. |
| Researcher Affiliation | Academia | Cade Gordon, Amy X. Lu & Pieter Abbeel University of California, Berkeley EMAIL |
| Pseudocode | Yes | Algorithm 1 Traditional Pseudo Log Likelihood Calculation Algorithm 2 Single-Inference Pseudo Log Likelihood |
| Open Source Code | No | The paper does not explicitly state that the authors are releasing their own code. It mentions using the 'Kronfluence library' by Bae (2024), which is a third-party implementation, but not the authors' specific code for their methodology. |
| Open Datasets | Yes | We take the wild type proteins in 217 DMS studies from Protein Gym (Notin et al., 2023) and calculating PLLs for each of them. To approximate ESM-2 s training distribution, we randomly sample 10,000 proteins from Uni Ref50 and trim sequences to be of length at most 1,024. We utilize the July 24, 2024 Uni Prot releases of both Uni Ref50 and Uni Ref100 (Suzek et al., 2015). |
| Dataset Splits | No | The paper refers to |
| Hardware Specification | Yes | Each run consumes a single 80GB A100 GPU. |
| Software Dependencies | No | The paper mentions 'Adam W' and the 'Kronfluence library' (Bae, 2024) but does not provide specific version numbers for these or any other software components used in their experiments. |
| Experiment Setup | Yes | Post training utilizes Adam W (Loshchilov et al., 2017) with a learning rate of 1e-6 for 5 epochs on the 1,000 most similar proteins to wild type as determined by E-value of mmseqs2 search with a maximum cut off of 1. Finetuning starts at a batch size of 32 is progressively halved in the occurence of an out of memory exception. |