Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding
Authors: Jiasheng Zhang, Delvin Ce Zhang, Shuang Liang, Zhengpin Li, Zhitao Ying, Jie Shao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Kara across six downstream tasks, such as amino acid contact prediction, homology detection, and stability prediction. Our analysis includes hyper-parameter sensitivity, component-wise ablations, detailed examinations of the generalization ability to unseen knowledge, and the analysis of model robustness to PKG incompleteness. Detailed task descriptions are in Appendix D. Experimental settings and implementation details are in Appendix E. Results are averaged over 3 independent runs. |
| Researcher Affiliation | Academia | Jiasheng Zhang 1 Delvin Ce Zhang 2 Shuang Liang 1 Zhengpin Li 3 Rex Ying 4 Jie Shao 1 1University of Electronic Science and Technology of China 2The Pennsylvania State University 3Fudan University 4Yale University. Correspondence to: Jie Shao <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in natural language and using diagrams in Section 3, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to the official code released by Zhou et al. (2023) for implementing downstream task experiments and provides links to third-party models (Prot Bert, Pub Med Bert) used, but does not explicitly state that the source code for Kara, the methodology described in this paper, is publicly available or provide a direct link to its repository. |
| Open Datasets | Yes | We train the proposed Kara using the Protein KG25 knowledge graph (Zhang et al., 2022a)... The raw data of Protein KG25 can be found in https://www.zjukg.org/project/Protein KG25/. ... Following Zhou et al. (2023), we use data that comes from Protein Net (Al Quraishi, 2019)... Experiments are done on three widely-used datasets SHS27K (Chen et al., 2019), SHS148K (Chen et al., 2019), and STRING (Lv et al., 2021)... We follow the datasets and experimental settings of Hou et al. (2018)... As in Rocklin et al. (2017), we use Spearman s rank correlation scores for evaluation. ... The SKEMPI dataset (Moal & Fern andez-Recio, 2012) is used. |
| Dataset Splits | Yes | Following Zhou et al. (2023), we use data that comes from Protein Net (Al Quraishi, 2019) and report precision on the Protein Net CASP12 test set... since the train/valid/test set splittings of SHS27K, SHS148K, and STRING datasets are not provided, we use the official code released by Lv et al. (2021) to split each dataset with three different random seeds, and the average performance of each dataset is reported. ... We follow the previous works and use data from Hou et al. (2018). By holding out entire evolutionary groups from the training set... We use the data provided by Rocklin et al. (2017), where the training set includes proteins from four rounds of experimental design, while the test set contains proteins that are Hamming distance-1 neighbors of the top candidates. ...Result is reported based on mean square error of 10-fold cross-validation. ... First, we randomly divide the triples (i.e., (protein, relation, go)) into training and testing sets in an 8:2 ratio. |
| Hardware Specification | Yes | All the experiments are conducted on NVIDIA A40 with 48 GB memory. |
| Software Dependencies | No | Our model is implemented with Python and we refer to the official code released by Zhou et al. (2023) to implement the downstream task experiments. All tasks use standard datasets and metrics, consistent with previous works, to ensure a fair comparison. Note that since the train/valid/test set splittings of SHS27K, SHS148K, and STRING datasets are not provided, we use the official code released by Lv et al. (2021) to split each dataset with three different random seeds, and the average performance of each dataset is reported. All the experiments are conducted on NVIDIA A40 with 48 GB memory. |
| Experiment Setup | Yes | In the pre-training stage, ...maximum token length is set as 1024 for proteins and 512 for text descriptions. ... The margin γ is set as 5 and the number of negative samples is set as 2. We set the batch size to 4 with the maximum number of update steps to 10,000, and the gradient accumulation step to 16. The learning rate is set as 1e-6 and we use Adam W (Loshchilov & Hutter, 2019) for optimization. The weight decay is set as 1e-2. ... In the knowledge retriever, we set the sampling number of neighbors during the candidate embedding generation as 100. ... The number of training epochs is set as 500 with the batch size as 100, and we use the early stopping strategy with a patience of 5. The learning rate is set as 1e-3 and the negative sampling number is set as 20. The margin γ is also set as 5. ... In the fine-tuning stage, ... Different downstream tasks require various fine-tuning hyper-parameters and we summarize them in Table 12. Additionally, we follow the implementations in GNN-PPI (Lv et al., 2021) for PPI prediction, where the number of epochs is 600 and batch size is 2048. The learning rate is set as 1e-3 for the SHS27K dataset and 1e-4 for the SHS148K and STRING datasets. |