KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA

Authors: Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, Marinka Zitnik

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on medical QA benchmarks show that KGAREVION improves accuracy by over 5.2% over 15 models in handling complex medical queries. To further assess its effectiveness, we curated three new medical QA datasets with varying levels of semantic complexity, where KGAREVION improved accuracy by 10.4%.
Researcher Affiliation Collaboration Xiaorui Su1 Yibo Wang2 Shanghua Gao1 Xiaolong Liu2 Valentina Giunchiglia3 Djork-Arn e Clevert4 Marinka Zitnik1 1Harvard University 2University of Illinois Chicago 3Imperial College London 4 Pfizer
Pseudocode No The paper describes the KGAREVION agent's actions (Generate, Review, Revise, Answer) in narrative text and figures, but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes KGAREVION is available at https://github.com/mims-harvard/KGARevion.
Open Datasets Yes We first start with four multi-choice medical QA benchmarks (Xiong et al., 2024a) (Table 1). In addition, we introduce a new benchmark for multi-choice complex medical QA focused on differential diagnosis (DDx), named Med DDx... Afri Med-QA, a newly published QA dataset released after all baseline models in this study (Olatunji et al., 2024).
Dataset Splits Yes During the fine-tuning stage, we first split Prime KG Chandak et al. (2023) into two parts: a training set and a testing set, in a ratio of 8:2.
Hardware Specification Yes All experiments are conducted on a machine equipped with 4 NVIDIA H100. We use 1 NVIDIA H100 to implement baselines with small LLMs. In the fine-tuning stage, we use 4 NVIDIA H100 to fine-tune the review module.
Software Dependencies Yes We implement KGAREVION using Python 3.9.19, Py Torch 2.3.1, Transformers 4.43.1, and Tokenizers 0.19.1.
Experiment Setup Yes For hyperparameter tuning, we use grid search to identify the optimal parameter combinations by evaluating the fine-tuned model s performance on the knowledge graph completion task using the testing set. Specifically, we focus on the parameter r in Lo RA training and the batch size during the fine-tuning stage. The values explored for r are 16, 32, 64, 128, while the tested batch sizes bz are 128, 256, 512, 1024. The best parameters identified are r = 32, bz = 256.