Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction

Authors: Ruben Weitzman, Peter Mørch Groth, Lood Van Niekerk, Aoi Otani, Yarin Gal, Debora Susan Marks, Pascal Notin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture- and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time offering a scalable alternative to alignment-centric approaches. We demonstrate Protriever achieves state-of-the-art performance among sequence-based models on the Protein Gym benchmarks, while being orders of magnitude faster to retrieve homologs than standard MSA approaches including Jack HMMER, MMseqs2, and MMseqs2-GPU ( 4 and 5).
Researcher Affiliation Collaboration Ruben Weitzman 1 2 Peter Mørch Groth 3 4 Lood Van Niekerk 5 Aoi Otani 2 Yarin Gal 1 Debora S. Marks 2 Pascal Notin 2 1Department of Computer Science, University of Oxford 2Department of Systems Biology, Harvard Medical School 3Department of Computer Science, University of Copenhagen 4Enzyme research, Novonesis 5Ginkgo Bioworks. Correspondence to: Ruben Weitzman <EMAIL>, Debora Marks <EMAIL>, Yarin Gal <EMAIL>, Pascal Notin <EMAIL>.
Pseudocode No The paper describes the Protriever framework with a diagram (Figure 1) and textual explanations of its components (Retriever module, Index, Reader module) and training procedure. However, it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes We make our code available at https://github.com/OATML-Markslab/Protriever.
Open Datasets Yes We evaluate Protriever on the substitution benchmark of Protein Gym (Notin et al., 2023), containing 217 deep mutational scanning (DMS) experiments that probe the natural function of protein variants. DMS experiments systematically measure the functional effects of individual amino acid substitutions across a protein sequence, providing comprehensive fitness landscapes for specific proteins. Consequently, to perform well on this benchmark, models must capture a nuanced understanding of the biochemical constraints for the corresponding proteins as they must be able to detect subtle effects resulting from minor sequence changes.
Dataset Splits Yes We evaluate Protriever on the substitution benchmark of Protein Gym (Notin et al., 2023), containing 217 deep mutational scanning (DMS) experiments that probe the natural function of protein variants. Additionally, we score sequences in both directions (N-terminus to C-terminus and vice versa), a strategy shown to improve predictive performance (Notin et al., 2022). For ten validation sets (see Appendix F), we use wild-type sequences as queries and retrieve homologs using the same four methods as above.
Hardware Specification Yes To apply this scoring methodology, we first build an index of all protein sequences in our database. At inference time, we use the trained retriever from Protriever to encode all 62 million Uni Ref50 sequences. This process is parallelized across GPUs and uses Flash Attention (Dao et al., 2022) to enable large batch sizes, completing in approximately 30 minutes on four A100 GPUs. Protriever and GPU-accelerated MMseqs2 searches are made on a single L40S GPU using one CPU thread.
Software Dependencies No The paper mentions using 'Faiss for GPU-accelerated vector similarity search' and 'Adam W' as an optimizer, and 'ESM encoder' and 'Tranception decoder' architectures. However, specific version numbers for these software components or libraries are not provided in the text.
Experiment Setup Yes We train our model (without DPR pretraining) within the Proriever framework, with EMDR end to end loss on the retriever for 50,000 iterations. We use Adam W with a batch size of 16, a context size set to 20, learning rates of 4 10 5 for the reader and 5 10 5 for the retriever, with linear decay and 1,000 warm-up steps. We re-index our dataset every 5,000 steps for a total of 10 re-indexing stages.