ProtCLIP: Function-Informed Protein Multi-Modal Learning
Authors: Hanjing Zhou, Mingze Yin, Wei Wu, Mingyang Li, Kun Fu, Jintai Chen, Jian Wu, Zheng Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross-modal transformation, semantic similarity inference and protein-protein interaction prediction, our Prot CLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks, 59.9% in GO-CC and 39.7% in GO-BP protein function prediction. The experimental results verify the extraordinary potential of Prot CLIP serving as the protein multi-modality foundation model. |
| Researcher Affiliation | Collaboration | 1College of Computer Science and Technology, Zhejiang University, 2State Key Laboratory of Transvascular Implantation Devices of The Second Affiliated Hospital, Zhejiang University, 3Alibaba Cloud Computing, 4School of Artificial Intelligence and Data Science, University of Science and Technology of China, 5AI Thrust, Information Hub, HKUST(Guangzhou), 6Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence |
| Pseudocode | No | The paper describes methods in prose and illustrates the model architecture in Figure 3, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code, a link to a code repository, or mention code in supplementary materials. |
| Open Datasets | Yes | Our pre-training data is sourced from Swiss Prot and tr EMBL (Bairoch and Apweiler 2000)... We utilize β-lactamase (β-lac) landscape from PEER (Xu et al. 2022), Fluorescence (Flu) and Stability (Sta) landscapes from TAPE (Rao et al. 2019), and AAV and Thermostability (Thermo) landscapes from FLIP (Dallago et al. 2021)... Following (Wang et al. 2024), we leverage the raw knowledge graph (KG) data... F1 score is reported on SHS27K (Chen et al. 2019), SHS148K (Chen et al. 2019) and STRING (Lv et al. 2021) datasets for evaluation. |
| Dataset Splits | Yes | Following (Wang et al. 2024), we leverage the raw knowledge graph (KG) data and undertake some preprocessing steps, with the training/validation/test split of 80%/10%/10%. |
| Hardware Specification | Yes | We build our codes upon the Py Torch framework and conduct experiments on 64 Tesla V100 GPUs with 10,000 GPU hours. |
| Software Dependencies | No | The paper states 'We build our codes upon the Py Torch framework' but does not specify a version number for PyTorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | An Adam optimizer is used (learning rate: 1.0 10 5, weight decay: 0) to train the model. The batch size is 2048 and 512 for pre-training and downstream experiments. Within the function-informed pre-training paradigm, we set hyper-parameters θ = 0.3, λ1 = 0.7, λ2 = 0.3. |