reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Authors: Yuxin Zuo, Shang Qu, Yifei Li, Zhang-Ren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate 18 leading models on Med Xpert QA. Moreover, medicine is deeply connected to realworld decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
Researcher Affiliation	Academia	1Tsinghua University, Beijing, China 2Shanghai Artificial Intelligence Laboratory, Shanghai, China. Correspondence to: Ning Ding <EMAIL>, Bowen Zhou <EMAIL>.
Pseudocode	No	The paper describes methodologies for benchmark construction and model evaluation but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code: https://github.com/Tsinghua C3I/Med Xpert QA
Open Datasets	Yes	We introduce Med Xpert QA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. Med Xpert QA includes 4, 460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation.
Dataset Splits	Yes	We introduce Med Xpert QA, a universal medical benchmark consisting of challenging text and multimodal subsets Text and MM, which are each divided into a few-shot development set with 5 questions and a test set.
Hardware Specification	No	No specific GPU/CPU models, processor types, or detailed cloud instance names used for running experiments are mentioned.
Software Dependencies	No	The paper lists specific versions of models used as tools or objects of evaluation (e.g., 'gpt-4o-2024-11-20', 'claude-3-5-sonnet-20241022') but does not specify traditional ancillary software dependencies (like programming languages, libraries, or frameworks) with version numbers for their own implementation.
Experiment Setup	Yes	We employ greedy decoding for output generation if available, ensuring result stability. For reasoning models with specific evaluation requirements, we follow their respective instructions. Appendix C.2 presents additional implementation details. We could not evaluate o1 and o3-mini on the full Med Xpert QA due to costs. Instead, for both Med Xpert QA Text and Med Xpert QA MM, we sample 10% of questions from the Reasoning and Understanding subsets respectively. The seed is set to 42.