Protein Language Model Fitness is a Matter of Preference

Authors: Cade Gordon, Amy Lu, Pieter Abbeel

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work studies trends observed over hundreds of deep mutational scans across multiple different fitness objectives. We find that the likelihood, or abstractly, implicit preference of a certain protein sequence imbued during pretraining is predictive of fitness prediction capabilities. Both over-preferred and under-preferred wild type sequences harm performance. Using influence functions to causally understand how individual data points increase protein likelihoods, we find that there exists a power law tail due to sequence homology. Lastly, under-performance on low likelihood wild type proteins can be remedied by unsupervised finetuning. These findings that p LM zero-shot fitness estimation can be predicted by the likelihood of the engineered sequence can motivate and improve p LMs deployment in protein maturation campaigns.
Researcher Affiliation Academia Cade Gordon, Amy X. Lu & Pieter Abbeel University of California, Berkeley EMAIL
Pseudocode Yes Algorithm 1 Traditional Pseudo Log Likelihood Calculation Algorithm 2 Single-Inference Pseudo Log Likelihood
Open Source Code No The paper does not explicitly state that the authors are releasing their own code. It mentions using the 'Kronfluence library' by Bae (2024), which is a third-party implementation, but not the authors' specific code for their methodology.
Open Datasets Yes We take the wild type proteins in 217 DMS studies from Protein Gym (Notin et al., 2023) and calculating PLLs for each of them. To approximate ESM-2 s training distribution, we randomly sample 10,000 proteins from Uni Ref50 and trim sequences to be of length at most 1,024. We utilize the July 24, 2024 Uni Prot releases of both Uni Ref50 and Uni Ref100 (Suzek et al., 2015).
Dataset Splits No The paper refers to
Hardware Specification Yes Each run consumes a single 80GB A100 GPU.
Software Dependencies No The paper mentions 'Adam W' and the 'Kronfluence library' (Bae, 2024) but does not provide specific version numbers for these or any other software components used in their experiments.
Experiment Setup Yes Post training utilizes Adam W (Loshchilov et al., 2017) with a learning rate of 1e-6 for 5 epochs on the 1,000 most similar proteins to wild type as determined by E-value of mmseqs2 search with a maximum cut off of 1. Finetuning starts at a batch size of 32 is progressively halved in the occurence of an out of memory exception.