reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Visual and Domain Knowledge for Professional-level Graph-of-Thought Medical Reasoning

Authors: Rina Bao, Shilong Dong, Zhenfang Chen, Sheng He, Ellen Grant, Yangming Ou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of current large vision-language models (LVLMs) shows limited performance on this benchmark, highlighting both the challenges of the task and the importance of this benchmark for advancing medical AI. Furthermore, we propose a novel Clinical Graph of Thoughts model, which integrates domain-specific medical knowledge and clinical reasoning processes with the interpretive abilities of LVLMs. The model demonstrates promising results, achieving around 15% absolute gain on the most important neurocognitive outcome task
Researcher Affiliation	Collaboration	1Boston Children s Hospital and Harvard Medical School, Boston, USA 2New York University 3MIT-IBM Watson AI Lab. Correspondence to: Zhenfang Chen <EMAIL>, Yangming Ou <EMAIL>.
Pseudocode	No	The paper describes the 'Clinical Graph of Thought Model' and its reasoning flow, but it does not present this as a structured pseudocode block or algorithm.
Open Source Code	Yes	Project page: https://github.com/ i3-research/HIE-Reasoning
Open Datasets	Yes	The HIE-Reasoning is the first publicly available HIE dataset that integrates MRIs, clinical information, neurocognitive outcomes, and includes question-answer (QA) pairs along with comprehensive MRI interpretation summaries.
Dataset Splits	No	The paper states the total number of individuals and QA pairs in the dataset (133 individuals, 749 QA pairs) but does not provide specific training, validation, or test splits for this dataset used in experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running the experiments.
Software Dependencies	No	The paper mentions several baseline LVLMs and the DRAMMS tool but does not specify version numbers for any software dependencies or libraries crucial for replication.
Experiment Setup	No	For the evaluated LVLMs, the paper states: 'All settings and hyperparameters are configured according to the specifications of the released versions.' However, no specific hyperparameters (e.g., learning rate, batch size, number of epochs) are provided for the proposed CGoT model or the general experimental setup.