reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks

Authors: Kim-Celine Kahl, Selen Erkan, Jeremias Traub, Carsten T. Lüth, Klaus Maier-Hein, Lena Maier-hein, Paul F Jaeger

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the relevance of this framework, we conduct a study on the robustness of various Fine-Tuning (FT) methods across three medical datasets with four types of distribution shifts. Our study highlights key insights into robustness: 1) No FT method consistently outperforms others in robustness, and 2) robustness trends are more stable across FT methods than across distribution shifts. Additionally, we find that simple sanity baselines that do not use the image data can perform surprisingly well and confirm Lo RA as the best-performing FT method on in-distribution data.
Researcher Affiliation	Academia	1German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany 2Helmholtz Imaging, German Cancer Research Center (DKFZ), Heidelberg, Germany 3Faculty of Mathematics and Computer Science, University of Heidelberg, Germany 4German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Germany 5German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Germany 6Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Germany 7National Center for Tumor Diseases (NCT) Heidelberg, Germany
Pseudocode	No	The paper describes methods and processes in textual form and through figures but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Code is provided at https://github.com/IML-DKFZ/sure-vqa.
Open Datasets	Yes	We utilize three datasets from the medical VQA domain, including SLAKE (Liu et al. (2021)), OVQA (Huang et al. (2022)), and MIMIC-CXR-VQA (Bae et al. (2023)).
Dataset Splits	Yes	We utilize above mentioned VQA datasets, splitting them so that the training and testing distributions differ. We employ two different base models: LLa VA-Med 1.5, a state-of-the-art medical VLM (Li et al. (2023)), and its corresponding non-medical base model, LLa VA 1.6 (Liu et al. (2023a)). For fine-tuning, we use the following methods: full FT, prompt tuning (Lester et al. (2021)), Lo RA (Hu et al. (2021)), and (IA)3(Liu et al. (2022)). Hyperparameters for the PEFT methods are selected based on the full training set and corresponding validation set for each dataset. Details regarding the hyperparameter search can be found in Appendix C.2. To measure robustness, we split the data into i.i.d. training and i.i.d. and Oo D test sets, as outlined in section 3.1, thereby fulfilling R1. We then measure the performance of the VLMs using the language model Gemma, fulfilling R2. The selection of Gemma as an evaluator is further justified in section 3.3.2. For robustness measurement, we calculate the relative robustness (RR) (Chen et al. (2023)), defined as RR = 1 P/PI, where P = (PI PO), and PI is the i.i.d test performance and PO is the Oo D test performance. For a better interpretation of the results, we compare them against relevant sanity baselines as described in R3.
Hardware Specification	No	The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running the experiments.
Software Dependencies	No	The paper mentions software tools and models like 'Gemma model (Team Gemma et al. (2024))', 'LLa VA-Med 1.5 (Li et al. (2023))', 'LLa VA 1.6 (Liu et al. (2023a))', and 'Open CV', but it does not specify explicit version numbers for these or other key software components required for replication.
Experiment Setup	Yes	Hyperparameters for the PEFT methods are selected based on the full training set and corresponding validation set for each dataset. Details regarding the hyperparameter search can be found in Appendix C.2. For the hyperparameter sweeps, we trained on the whole training set for each dataset and PEFT method and ran inference on the validation set. Training ran for 3 epochs and 3 seeds for each experiment.