ProtCLIP: Function-Informed Protein Multi-Modal Learning

Authors: Hanjing Zhou, Mingze Yin, Wei Wu, Mingyang Li, Kun Fu, Jintai Chen, Jian Wu, Zheng Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross-modal transformation, semantic similarity inference and protein-protein interaction prediction, our Prot CLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks, 59.9% in GO-CC and 39.7% in GO-BP protein function prediction. The experimental results verify the extraordinary potential of Prot CLIP serving as the protein multi-modality foundation model.
Researcher Affiliation Collaboration 1College of Computer Science and Technology, Zhejiang University, 2State Key Laboratory of Transvascular Implantation Devices of The Second Affiliated Hospital, Zhejiang University, 3Alibaba Cloud Computing, 4School of Artificial Intelligence and Data Science, University of Science and Technology of China, 5AI Thrust, Information Hub, HKUST(Guangzhou), 6Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence
Pseudocode No The paper describes methods in prose and illustrates the model architecture in Figure 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing code, a link to a code repository, or mention code in supplementary materials.
Open Datasets Yes Our pre-training data is sourced from Swiss Prot and tr EMBL (Bairoch and Apweiler 2000)... We utilize β-lactamase (β-lac) landscape from PEER (Xu et al. 2022), Fluorescence (Flu) and Stability (Sta) landscapes from TAPE (Rao et al. 2019), and AAV and Thermostability (Thermo) landscapes from FLIP (Dallago et al. 2021)... Following (Wang et al. 2024), we leverage the raw knowledge graph (KG) data... F1 score is reported on SHS27K (Chen et al. 2019), SHS148K (Chen et al. 2019) and STRING (Lv et al. 2021) datasets for evaluation.
Dataset Splits Yes Following (Wang et al. 2024), we leverage the raw knowledge graph (KG) data and undertake some preprocessing steps, with the training/validation/test split of 80%/10%/10%.
Hardware Specification Yes We build our codes upon the Py Torch framework and conduct experiments on 64 Tesla V100 GPUs with 10,000 GPU hours.
Software Dependencies No The paper states 'We build our codes upon the Py Torch framework' but does not specify a version number for PyTorch or any other software dependencies with their versions.
Experiment Setup Yes An Adam optimizer is used (learning rate: 1.0 10 5, weight decay: 0) to train the model. The batch size is 2048 and 512 for pre-training and downstream experiments. Within the function-informed pre-training paradigm, we set hyper-parameters θ = 0.3, λ1 = 0.7, λ2 = 0.3.