reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reliable and Diverse Evaluation of LLM Medical Knowledge Mastery

Authors: Yuxuan Zhou, Xien Liu, Chen Ning, Xiao Zhang, Ji Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs, based on two knowledge bases that are crucial for clinical diagnosis and treatment. The evaluation results illustrate that current LLMs still exhibit significant deficiencies in fully mastering medical knowledge, despite achieving considerable success on some famous public benchmarks. These new findings provide valuable insights for developing medical-specific LLMs, highlighting that current LLMs urgently need to strengthen their comprehensive and in-depth mastery of medical knowledge before being applied to real-world medical scenarios.
Researcher Affiliation	Academia	1Department of Electronic Engineering, Tsinghua University, Beijing, China 2College of AI, Tsinghua University, Beijing, China 3BNRist, Beijing, China
Pseudocode	No	No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described through text and architectural diagrams like Figure 2 and Figure 3.
Open Source Code	Yes	We release the codes and datasets to facilitate future study2. 2https://github.com/THUMLP/Pretex Eval
Open Datasets	Yes	We release the codes and datasets to facilitate future study2. 2https://github.com/THUMLP/Pretex Eval
Dataset Splits	Yes	Considering computational costs and dataset size, we select a subset from each dataset for evaluation. Specifically, we randomly select a single entity from the corresponding tail entities for each pair of a head entity and a relation. This approach aims to reduce the evaluation scale while maximizing the diversity of the evaluated knowledge. ... The sampled Med LAMA dataset includes 1,000 positive triplets and 1,000 negative triplets for each relation, while the detailed statistics for Dise K are presented in Table 6.
Hardware Specification	No	No specific hardware details (like GPU models, CPU models, or memory) are provided for running the experiments or finetuning models. The paper mentions evaluating LLMs and finetuning Llama3-8B but omits hardware specifications.
Software Dependencies	No	The paper mentions utilizing 'Llama3-70B-Instruct (AI@Meta, 2024)' for rephrasing and finetuning 'LLa MA3-8B' using 'Lo RA finetuning (Hu et al., 2022)' but does not provide specific software versions for libraries, frameworks, or programming languages used for implementation.
Experiment Setup	Yes	Method Setting To ensure the diversity of evaluation, we combined the three types of predicate transformation and generated m = 8 expressions (variants) for each knowledge point... For LLM evaluation, we employ the popular 5-shot in-context learning setting (Brown et al., 2020)... In the inference process, we use greedy search for most of LLMs... Appendix I: ...applying Lo RA finetuning (Hu et al., 2022) as the training method. We apply a grid search on the learning rate {1e-4, 5e-5, 2e-5} and batch size {4, 8, 16} to find the best hyperparameters. We train each model for 10 epochs.