reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Knowledge Graph Finetuning Enhances Knowledge Manipulation in Large Language Models

Authors: Hanzhu Chen, Xu Shen, Jie Wang, Zehao Wang, Qitan Lv, Junjie He, Rong Wu, Feng Wu, Jieping Ye

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on fifteen different domains and six different languages demonstrate the effectiveness of KG-SFT, leading to an accuracy improvement of up to 18.1% and an average of 8.7% in low-data scenarios.
Researcher Affiliation	Academia	1 Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China EMAIL EMAIL 2 Independent Researcher EMAIL 3 Zhejiang University {wurong1159}@zju.edu.cn
Pseudocode	No	The paper describes the BM25 algorithm and HITS algorithm in Section 3 and the components of KG-SFT (Extractor, Generator, Detector) in Section 4, but it does not present these as structured pseudocode or algorithm blocks.
Open Source Code	No	Moreover, we are committed to providing the source code of our approach, if accepted.
Open Datasets	Yes	We choose the medical field as a canonical low-data and knowledge-intensive field... Therefore, our evaluation task adopts multiple-choice questions and selects medical examination questions in six languages as the evaluation data. Please refer to Appendix A.1 for the statistics of our datasets. Table 8 presents the statistical results for medical multiple-choice questions benchmarks in six language. Dataset Language Source Train Test Med QA English United States Medical Licensing Examination 10178 1273 Med QA Chinese United States Medical Licensing Examination 27400 3426 Igaku QA Japanese Japan s medical licensure exams (2018-2022) 1590 199 Ru Med Da Net Russian Russian medical judgment question dataset 1052 256 French Med MCQA French Professional exams for the French Pharmacy degree 2171 622 Head-QA Spanish Exams for positions in the Spanish healthcare 2657 2742
Dataset Splits	Yes	Table 8 presents the statistical results for medical multiple-choice questions benchmarks in six language. Dataset Language Source Train Test Med QA English United States Medical Licensing Examination 10178 1273 Med QA Chinese United States Medical Licensing Examination 27400 3426 Igaku QA Japanese Japan s medical licensure exams (2018-2022) 1590 199 Ru Med Da Net Russian Russian medical judgment question dataset 1052 256 French Med MCQA French Professional exams for the French Pharmacy degree 2171 622 Head-QA Spanish Exams for positions in the Spanish healthcare 2657 2742
Hardware Specification	Yes	Our experiments were performed using 4 A100 GPUs(80GB) over 5 epochs with the LLa MA2-7B model.
Software Dependencies	No	The paper mentions using Deep Speed and BF16 data type, and refers to LLaMA-Factory for model training without providing specific version numbers for these or other key software components.
Experiment Setup	Yes	In the fine-tuning phase, our optimization objective is minimizing the loss between generated text and target text. We set the maximum context length to 2048, padding each batch to match the longest sequence in that batch. We use Adam W optimizer with the following hyper-parameters: β1 = 0.95, β2 = 0.9. For full-model fine-tuning, we utilized Deep Speed, BF16 data type, and gradient checkpointing technology. We set the global batch size to 64 and the warmup ratio to 0.03. For vanilla SFT data without explanations, we set a learning rate of 1e-6. In the case of the enhanced KG-SFT data with explanations, we set a learning rate of 5e-6. Finally, the models are trained on four A100 GPUs for 5 epochs.