reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ProtCLIP: Function-Informed Protein Multi-Modal Learning

Authors: Hanjing Zhou, Mingze Yin, Wei Wu, Mingyang Li, Kun Fu, Jintai Chen, Jian Wu, Zheng Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross-modal transformation, semantic similarity inference and protein-protein interaction prediction, our Prot CLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks, 59.9% in GO-CC and 39.7% in GO-BP protein function prediction. The experimental results verify the extraordinary potential of Prot CLIP serving as the protein multi-modality foundation model.
Researcher Affiliation	Collaboration	1College of Computer Science and Technology, Zhejiang University, 2State Key Laboratory of Transvascular Implantation Devices of The Second Affiliated Hospital, Zhejiang University, 3Alibaba Cloud Computing, 4School of Artificial Intelligence and Data Science, University of Science and Technology of China, 5AI Thrust, Information Hub, HKUST(Guangzhou), 6Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence
Pseudocode	No	The paper describes methods in prose and illustrates the model architecture in Figure 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing code, a link to a code repository, or mention code in supplementary materials.
Open Datasets	Yes	Our pre-training data is sourced from Swiss Prot and tr EMBL (Bairoch and Apweiler 2000)... We utilize β-lactamase (β-lac) landscape from PEER (Xu et al. 2022), Fluorescence (Flu) and Stability (Sta) landscapes from TAPE (Rao et al. 2019), and AAV and Thermostability (Thermo) landscapes from FLIP (Dallago et al. 2021)... Following (Wang et al. 2024), we leverage the raw knowledge graph (KG) data... F1 score is reported on SHS27K (Chen et al. 2019), SHS148K (Chen et al. 2019) and STRING (Lv et al. 2021) datasets for evaluation.
Dataset Splits	Yes	Following (Wang et al. 2024), we leverage the raw knowledge graph (KG) data and undertake some preprocessing steps, with the training/validation/test split of 80%/10%/10%.
Hardware Specification	Yes	We build our codes upon the Py Torch framework and conduct experiments on 64 Tesla V100 GPUs with 10,000 GPU hours.
Software Dependencies	No	The paper states 'We build our codes upon the Py Torch framework' but does not specify a version number for PyTorch or any other software dependencies with their versions.
Experiment Setup	Yes	An Adam optimizer is used (learning rate: 1.0 10 5, weight decay: 0) to train the model. The batch size is 2048 and 512 for pre-training and downstream experiments. Within the function-informed pre-training paradigm, we set hyper-parameters θ = 0.3, λ1 = 0.7, λ2 = 0.3.