Reliable and Diverse Evaluation of LLM Medical Knowledge Mastery
Authors: Yuxuan Zhou, Xien Liu, Chen Ning, Xiao Zhang, Ji Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs, based on two knowledge bases that are crucial for clinical diagnosis and treatment. The evaluation results illustrate that current LLMs still exhibit significant deficiencies in fully mastering medical knowledge, despite achieving considerable success on some famous public benchmarks. These new findings provide valuable insights for developing medical-specific LLMs, highlighting that current LLMs urgently need to strengthen their comprehensive and in-depth mastery of medical knowledge before being applied to real-world medical scenarios. |
| Researcher Affiliation | Academia | 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2College of AI, Tsinghua University, Beijing, China 3BNRist, Beijing, China |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described through text and architectural diagrams like Figure 2 and Figure 3. |
| Open Source Code | Yes | We release the codes and datasets to facilitate future study2. 2https://github.com/THUMLP/Pretex Eval |
| Open Datasets | Yes | We release the codes and datasets to facilitate future study2. 2https://github.com/THUMLP/Pretex Eval |
| Dataset Splits | Yes | Considering computational costs and dataset size, we select a subset from each dataset for evaluation. Specifically, we randomly select a single entity from the corresponding tail entities for each pair of a head entity and a relation. This approach aims to reduce the evaluation scale while maximizing the diversity of the evaluated knowledge. ... The sampled Med LAMA dataset includes 1,000 positive triplets and 1,000 negative triplets for each relation, while the detailed statistics for Dise K are presented in Table 6. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU models, or memory) are provided for running the experiments or finetuning models. The paper mentions evaluating LLMs and finetuning Llama3-8B but omits hardware specifications. |
| Software Dependencies | No | The paper mentions utilizing 'Llama3-70B-Instruct (AI@Meta, 2024)' for rephrasing and finetuning 'LLa MA3-8B' using 'Lo RA finetuning (Hu et al., 2022)' but does not provide specific software versions for libraries, frameworks, or programming languages used for implementation. |
| Experiment Setup | Yes | Method Setting To ensure the diversity of evaluation, we combined the three types of predicate transformation and generated m = 8 expressions (variants) for each knowledge point... For LLM evaluation, we employ the popular 5-shot in-context learning setting (Brown et al., 2020)... In the inference process, we use greedy search for most of LLMs... Appendix I: ...applying Lo RA finetuning (Hu et al., 2022) as the training method. We apply a grid search on the learning rate {1e-4, 5e-5, 2e-5} and batch size {4, 8, 16} to find the best hyperparameters. We train each model for 10 epochs. |