KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA
Authors: Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, Marinka Zitnik
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on medical QA benchmarks show that KGAREVION improves accuracy by over 5.2% over 15 models in handling complex medical queries. To further assess its effectiveness, we curated three new medical QA datasets with varying levels of semantic complexity, where KGAREVION improved accuracy by 10.4%. |
| Researcher Affiliation | Collaboration | Xiaorui Su1 Yibo Wang2 Shanghua Gao1 Xiaolong Liu2 Valentina Giunchiglia3 Djork-Arn e Clevert4 Marinka Zitnik1 1Harvard University 2University of Illinois Chicago 3Imperial College London 4 Pfizer |
| Pseudocode | No | The paper describes the KGAREVION agent's actions (Generate, Review, Revise, Answer) in narrative text and figures, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | KGAREVION is available at https://github.com/mims-harvard/KGARevion. |
| Open Datasets | Yes | We first start with four multi-choice medical QA benchmarks (Xiong et al., 2024a) (Table 1). In addition, we introduce a new benchmark for multi-choice complex medical QA focused on differential diagnosis (DDx), named Med DDx... Afri Med-QA, a newly published QA dataset released after all baseline models in this study (Olatunji et al., 2024). |
| Dataset Splits | Yes | During the fine-tuning stage, we first split Prime KG Chandak et al. (2023) into two parts: a training set and a testing set, in a ratio of 8:2. |
| Hardware Specification | Yes | All experiments are conducted on a machine equipped with 4 NVIDIA H100. We use 1 NVIDIA H100 to implement baselines with small LLMs. In the fine-tuning stage, we use 4 NVIDIA H100 to fine-tune the review module. |
| Software Dependencies | Yes | We implement KGAREVION using Python 3.9.19, Py Torch 2.3.1, Transformers 4.43.1, and Tokenizers 0.19.1. |
| Experiment Setup | Yes | For hyperparameter tuning, we use grid search to identify the optimal parameter combinations by evaluating the fine-tuned model s performance on the knowledge graph completion task using the testing set. Specifically, we focus on the parameter r in Lo RA training and the batch size during the fine-tuning stage. The values explored for r are 16, 32, 64, 128, while the tested batch sizes bz are 128, 256, 512, 1024. The best parameters identified are r = 32, bz = 256. |