reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Protein Language Model Fitness is a Matter of Preference

Authors: Cade Gordon, Amy Lu, Pieter Abbeel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work studies trends observed over hundreds of deep mutational scans across multiple different fitness objectives. We find that the likelihood, or abstractly, implicit preference of a certain protein sequence imbued during pretraining is predictive of fitness prediction capabilities. Both over-preferred and under-preferred wild type sequences harm performance. Using influence functions to causally understand how individual data points increase protein likelihoods, we find that there exists a power law tail due to sequence homology. Lastly, under-performance on low likelihood wild type proteins can be remedied by unsupervised finetuning. These findings that p LM zero-shot fitness estimation can be predicted by the likelihood of the engineered sequence can motivate and improve p LMs deployment in protein maturation campaigns.
Researcher Affiliation	Academia	Cade Gordon, Amy X. Lu & Pieter Abbeel University of California, Berkeley EMAIL
Pseudocode	Yes	Algorithm 1 Traditional Pseudo Log Likelihood Calculation Algorithm 2 Single-Inference Pseudo Log Likelihood
Open Source Code	No	The paper does not explicitly state that the authors are releasing their own code. It mentions using the 'Kronfluence library' by Bae (2024), which is a third-party implementation, but not the authors' specific code for their methodology.
Open Datasets	Yes	We take the wild type proteins in 217 DMS studies from Protein Gym (Notin et al., 2023) and calculating PLLs for each of them. To approximate ESM-2 s training distribution, we randomly sample 10,000 proteins from Uni Ref50 and trim sequences to be of length at most 1,024. We utilize the July 24, 2024 Uni Prot releases of both Uni Ref50 and Uni Ref100 (Suzek et al., 2015).
Dataset Splits	No	The paper refers to
Hardware Specification	Yes	Each run consumes a single 80GB A100 GPU.
Software Dependencies	No	The paper mentions 'Adam W' and the 'Kronfluence library' (Bae, 2024) but does not provide specific version numbers for these or any other software components used in their experiments.
Experiment Setup	Yes	Post training utilizes Adam W (Loshchilov et al., 2017) with a learning rate of 1e-6 for 5 epochs on the 1,000 most similar proteins to wild type as determined by E-value of mmseqs2 search with a maximum cut off of 1. Finetuning starts at a batch size of 32 is progressively halved in the occurence of an out of memory exception.