reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ProteinBench: A Holistic Evaluation of Protein Foundation Models

Authors: Fei YE, Zaixiang Zheng, Dongyu Xue, Yuning Shen, Lihao Wang, Yiming Ma, Yan Wang, Xinyou Wang, Xiangxin Zhou, Quanquan Gu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a leaderboard publicly for further analysis and a general modular toolkit.
Researcher Affiliation	Industry	Fei Ye , Zaixiang Zheng , Dongyu Xue , Yuning Shen , Lihao Wang Yiming Ma, Yan Wang, Xinyou Wang, Xiangxin Zhou and Quanquan Gu Byte Dance Research EMAIL
Pseudocode	No	The paper describes methodologies in prose and uses conceptual diagrams (e.g., Figure 1, Figure 3) to illustrate processes, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	To promote transparency and facilitate further research, we release the evaluation dataset, code, and a leaderboard publicly for further analysis and a general modular toolkit. We intend for Protein Bench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field. Project page: https://proteinbench.github.io/
Open Datasets	Yes	To promote transparency and facilitate further research, we release the evaluation dataset, code, and a leaderboard publicly for further analysis and a general modular toolkit. [Datasets] Typically, inverse folding methods are trained using CATH or PDB datasets. To prevent data leakage, we utilized newly released PDB structures collected from CASP and CAMEO, as well as de novo designed backbones that have not been included in any training sets. Evaluations were conducted on different datasets targeting two distinct objectives of structure-based sequence design. [Datasets] The Structural Antibody Database (SAb Dab Dunbar et al. (2013)) is the commonly used dataset for antibody design. [Datasets] We use CAMEO2022 from Jing et al. (2023) for evaluation, which consists of 183 short-to-mid-length single protein chains (< 750 amino acids) from the targets of CAMEO (a continuous benchmarking initiative for structure prediction of newly deposited protein structures). [Datasets] We evaluate performance using the ATLAS dataset (Vander Meersche et al., 2024), a recent database of MD simulation results for diverse proteins.
Dataset Splits	Yes	For each model and sequence length, we sample 50 sequences to evaluate their performance. Training data: To build the unified training data, we use antibody-antigen complex structural data from the SAb Dab dataset under the IMGT scheme (Lefranc et al., 2009) as the training dataset...Finally, we select the clusters that do not contain complexes in the RAb D dataset and split the complexes into training and validation sets with a ratio of 9:1 (1786 and 193 complexes respectively). Test data: To build the unified test data, we extracted 55 antibody-antigen complexes from the RAb D dataset. CAMEO2022... consists of 183 single protein chains collected from CAMEO targets between August and October 2022, with sequence lengths of less than 750 amino acids. A total of 20 conformations are sampled for each protein during evaluation. A total of 1,000 conformations are sampled for evaluation. We sample 250 conformations for each protein for evaluation.
Hardware Specification	No	The paper does not explicitly state the specific hardware (e.g., GPU models, CPU types) used for running its experiments. While it mentions using tools like Colab Fold for inference, it does not specify the authors' own experimental hardware setup.
Software Dependencies	Yes	We used openfold v2.0.0 for inference with their pretrained Open Fold weights (with p TM).
Experiment Setup	Yes	A sampling temperature of 0.1 was used for each method to generate sequences. While this value balances sequence diversity and quality, optimal temperatures may vary across inverse folding methods. Protein MPNN (Dauparas et al., 2022a): We follow the official repository and instructions for inference, with the sampling temperature set to 0.1. The default model weight v 48 020.pt is used. The generation process is constrained only by specified target lengths, which we set at 50, 100, 200, 300, 400, and 500 residues. In dy MEAN-Fix FR, we also used Rosetta (Alford et al., 2017) to repack the side chains, consistent with other methods, to avoid the influence of the side chains generated by dy MEAN on the evaluation results. Additionally, we introduced some randomness in the initialization of the structure, which allows dy MEAN-Fix FR to generate multiple different antibodies for the same antigen. All models were run with default settings. We then calculated the energy on the all-atom structure. Finally, we used the Interface Analyzer in Rosetta to calculate the binding energy between CDR-H3 and the antigen. During minimization, we set the step to 100 (we tried using more steps and repeats, although the energy did further decrease, the reduction was very limited and much smaller than the energy difference between different methods; however, the time consumption significantly increased).