Enhancing Multimodal Protein Function Prediction Through Dual-Branch Dynamic Selection with Reconstructive Pre-Training

Authors: Xiaoling Luo, Peng Chen, Chengliang Liu, Xiaopeng Jin, Jie Wen, Yumeng Liu, Junsong Wang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models. 3 Experiments In this section, we present the experimental setup, including the datasets, baseline models, training details, and evaluation metrics. Then we provide an analysis of the experimental results, supported by ablation studies and Davies-Bouldin scores to validate the effectiveness of the model.
Researcher Affiliation Academia 1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China 2College of Applied Technology, Shenzhen University, Shenzhen, China 3Laboratory for Artificial Intelligence in Design, Hong Kong 4College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China 5College of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
Pseudocode Yes Algorithm 1 Dynamic Selection Moudle Procedure Input: Protein vector Xdsm , Threshold t Output: Fusion feature after DSM 1: Initialize expert weights W 0N. 2: Compute expert confidence coefficients ˆp Softmax(MLP(Xdsm)). 3: Select active experts S {Ei|ˆpi t}. 4: for each experts Ei in S do 5: Normalize ˆp to obtain weights Wi ˆpi P 6: end for 7: return DSM(Xdsm) Concat(Wi Ei(Xdsm))
Open Source Code Yes The code and supplementary materials have been open-sourced1. 1https://github.com/kioedru/DSRPGO
Open Datasets Yes We construct our dataset based on CFAGO [Wu et al., 2023]. PPI data comes from the STRING [Szklarczyk et al., 2023] database (v11.5), and protein sequences, subcellular localization, and domain data are from the Uni Prot [Consortium, 2022] database (v3.5.175). A total of 19,385 proteins are used for pretraining. For fine-tuning, we collect protein function annotations from the Gene Ontology [Aleksander et al., 2023] database (v2022-01-13).
Dataset Splits Yes The fine-tuning datasets for each GO branch, split by two-time points, including BPO: 3,197 training, 304 validation, 182 testing proteins (45 GO terms), MFO: 2,747 training, 503 validation, 719 testing proteins (38 GO terms), and CCO: 5,263 training, 577 validation, 119 testing proteins (35 GO terms).
Hardware Specification Yes We conduct all experiments on NVIDIA GTX 4090.
Software Dependencies No The text does not provide specific version numbers for any software or libraries, only mentions an optimizer like Adam W.
Experiment Setup Yes We set the dropout rate to 0.1 during pre-training, and the model trains for 5000 epochs, with a learning rate of 1e-5 for the first 2500 epochs and 1e-6 for the remaining 2500 epochs. During fine-tuning, we use a dropout rate of 0.3 and train for 100 epochs with the Adam W optimizer. The learning rate is set to 1e-3 for the first 50 epochs and reduced to 1e-4 for the remaining 50 epochs.