SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks
Authors: Kim-Celine Kahl, Selen Erkan, Jeremias Traub, Carsten T. Lüth, Klaus Maier-Hein, Lena Maier-hein, Paul F Jaeger
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the relevance of this framework, we conduct a study on the robustness of various Fine-Tuning (FT) methods across three medical datasets with four types of distribution shifts. Our study highlights key insights into robustness: 1) No FT method consistently outperforms others in robustness, and 2) robustness trends are more stable across FT methods than across distribution shifts. Additionally, we find that simple sanity baselines that do not use the image data can perform surprisingly well and confirm Lo RA as the best-performing FT method on in-distribution data. |
| Researcher Affiliation | Academia | 1German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany 2Helmholtz Imaging, German Cancer Research Center (DKFZ), Heidelberg, Germany 3Faculty of Mathematics and Computer Science, University of Heidelberg, Germany 4German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Germany 5German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Germany 6Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Germany 7National Center for Tumor Diseases (NCT) Heidelberg, Germany |
| Pseudocode | No | The paper describes methods and processes in textual form and through figures but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | Code is provided at https://github.com/IML-DKFZ/sure-vqa. |
| Open Datasets | Yes | We utilize three datasets from the medical VQA domain, including SLAKE (Liu et al. (2021)), OVQA (Huang et al. (2022)), and MIMIC-CXR-VQA (Bae et al. (2023)). |
| Dataset Splits | Yes | We utilize above mentioned VQA datasets, splitting them so that the training and testing distributions differ. We employ two different base models: LLa VA-Med 1.5, a state-of-the-art medical VLM (Li et al. (2023)), and its corresponding non-medical base model, LLa VA 1.6 (Liu et al. (2023a)). For fine-tuning, we use the following methods: full FT, prompt tuning (Lester et al. (2021)), Lo RA (Hu et al. (2021)), and (IA)3(Liu et al. (2022)). Hyperparameters for the PEFT methods are selected based on the full training set and corresponding validation set for each dataset. Details regarding the hyperparameter search can be found in Appendix C.2. To measure robustness, we split the data into i.i.d. training and i.i.d. and Oo D test sets, as outlined in section 3.1, thereby fulfilling R1. We then measure the performance of the VLMs using the language model Gemma, fulfilling R2. The selection of Gemma as an evaluator is further justified in section 3.3.2. For robustness measurement, we calculate the relative robustness (RR) (Chen et al. (2023)), defined as RR = 1 P/PI, where P = (PI PO), and PI is the i.i.d test performance and PO is the Oo D test performance. For a better interpretation of the results, we compare them against relevant sanity baselines as described in R3. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software tools and models like 'Gemma model (Team Gemma et al. (2024))', 'LLa VA-Med 1.5 (Li et al. (2023))', 'LLa VA 1.6 (Liu et al. (2023a))', and 'Open CV', but it does not specify explicit version numbers for these or other key software components required for replication. |
| Experiment Setup | Yes | Hyperparameters for the PEFT methods are selected based on the full training set and corresponding validation set for each dataset. Details regarding the hyperparameter search can be found in Appendix C.2. For the hyperparameter sweeps, we trained on the whole training set for each dataset and PEFT method and ran inference on the validation set. Training ran for 3 epochs and 3 seeds for each experiment. |