Reliable and Diverse Evaluation of LLM Medical Knowledge Mastery

Authors: Yuxuan Zhou, Xien Liu, Chen Ning, Xiao Zhang, Ji Wu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs, based on two knowledge bases that are crucial for clinical diagnosis and treatment. The evaluation results illustrate that current LLMs still exhibit significant deficiencies in fully mastering medical knowledge, despite achieving considerable success on some famous public benchmarks. These new findings provide valuable insights for developing medical-specific LLMs, highlighting that current LLMs urgently need to strengthen their comprehensive and in-depth mastery of medical knowledge before being applied to real-world medical scenarios.
Researcher Affiliation Academia 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2College of AI, Tsinghua University, Beijing, China 3BNRist, Beijing, China
Pseudocode No No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described through text and architectural diagrams like Figure 2 and Figure 3.
Open Source Code Yes We release the codes and datasets to facilitate future study2. 2https://github.com/THUMLP/Pretex Eval
Open Datasets Yes We release the codes and datasets to facilitate future study2. 2https://github.com/THUMLP/Pretex Eval
Dataset Splits Yes Considering computational costs and dataset size, we select a subset from each dataset for evaluation. Specifically, we randomly select a single entity from the corresponding tail entities for each pair of a head entity and a relation. This approach aims to reduce the evaluation scale while maximizing the diversity of the evaluated knowledge. ... The sampled Med LAMA dataset includes 1,000 positive triplets and 1,000 negative triplets for each relation, while the detailed statistics for Dise K are presented in Table 6.
Hardware Specification No No specific hardware details (like GPU models, CPU models, or memory) are provided for running the experiments or finetuning models. The paper mentions evaluating LLMs and finetuning Llama3-8B but omits hardware specifications.
Software Dependencies No The paper mentions utilizing 'Llama3-70B-Instruct (AI@Meta, 2024)' for rephrasing and finetuning 'LLa MA3-8B' using 'Lo RA finetuning (Hu et al., 2022)' but does not provide specific software versions for libraries, frameworks, or programming languages used for implementation.
Experiment Setup Yes Method Setting To ensure the diversity of evaluation, we combined the three types of predicate transformation and generated m = 8 expressions (variants) for each knowledge point... For LLM evaluation, we employ the popular 5-shot in-context learning setting (Brown et al., 2020)... In the inference process, we use greedy search for most of LLMs... Appendix I: ...applying Lo RA finetuning (Hu et al., 2022) as the training method. We apply a grid search on the learning rate {1e-4, 5e-5, 2e-5} and batch size {4, 8, 16} to find the best hyperparameters. We train each model for 10 epochs.