MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Authors: Yuxin Zuo, Shang Qu, Yifei Li, Zhang-Ren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate 18 leading models on Med Xpert QA. Moreover, medicine is deeply connected to realworld decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
Researcher Affiliation Academia 1Tsinghua University, Beijing, China 2Shanghai Artificial Intelligence Laboratory, Shanghai, China. Correspondence to: Ning Ding <EMAIL>, Bowen Zhou <EMAIL>.
Pseudocode No The paper describes methodologies for benchmark construction and model evaluation but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Code: https://github.com/Tsinghua C3I/Med Xpert QA
Open Datasets Yes We introduce Med Xpert QA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. Med Xpert QA includes 4, 460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation.
Dataset Splits Yes We introduce Med Xpert QA, a universal medical benchmark consisting of challenging text and multimodal subsets Text and MM, which are each divided into a few-shot development set with 5 questions and a test set.
Hardware Specification No No specific GPU/CPU models, processor types, or detailed cloud instance names used for running experiments are mentioned.
Software Dependencies No The paper lists specific versions of models used as tools or objects of evaluation (e.g., 'gpt-4o-2024-11-20', 'claude-3-5-sonnet-20241022') but does not specify traditional ancillary software dependencies (like programming languages, libraries, or frameworks) with version numbers for their own implementation.
Experiment Setup Yes We employ greedy decoding for output generation if available, ensuring result stability. For reasoning models with specific evaluation requirements, we follow their respective instructions. Appendix C.2 presents additional implementation details. We could not evaluate o1 and o3-mini on the full Med Xpert QA due to costs. Instead, for both Med Xpert QA Text and Med Xpert QA MM, we sample 10% of questions from the Reasoning and Understanding subsets respectively. The seed is set to 42.