reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Authors: Kangyu Zhu, Peng Xia, Yun Li, Hongtu Zhu, Sheng Wang, Huaxiu Yao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that MMed PO significantly enhances factual accuracy, achieving improvements over existing baseline methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in https://github.com/aiminglab/MMed PO. ... 4. Experiment In this section, we evaluate the effectiveness of MMed PO to answer the following questions: (1) Can MMed PO enhance the factual accuracy of Med-LVLMs compared to other alignment baselines? (2) How does each individual component of the framework contribute to overall performance? (3) Can MMed PO be compatible with different Med-LVLM architectures? (4) Does MMed PO improve Med-LVLMs responses in terms of clinical relevance? 4.1. Experimental Setups Evaluation Datasets. Implementation Details. Baselines. Evaluation Metrics. 4.2. Main Results Comparison with Baseline Methods. Comparison with Baseline Methods Enhanced by SFT. 4.3. Quantitative Analysis Ablation study. Multiple vs. Single Med-LLM. Impact of Localized Lesion Noise. Compatibility Analysis. 4.4. Qualitative Analysis and Case Study.
Researcher Affiliation	Academia	1UNC Chapel-Hill 2Brown University 3University of Washington. Correspondence to: Kangyu Zhu <EMAIL>, Peng Xia <EMAIL>, Huaxiu Yao <EMAIL>.
Pseudocode	Yes	Algorithm 1: Multimodal Medical Preference Optimization (MMed PO) Input: D = {x(i) v , x(i) t , y(i)}N i=1: Dataset; M( , ): Med-LVLM; T ( ): Visual Tool; G( ): Med-LLM; N( , ): Localized Nosiy Process; Z( ): Normalization. Output: πθ: Parameters of the Med-LVLM. 1 Initialize Do with an empty set 2 foreach (xv, xt, y) D do 3 Preference Data Curation 4 Generate responses of the Med-LVLM a M(xv, xt) 5 Select the dispreferred response yl GPT(a, y) 6 Quantify the Clinical Relevance 7 Quatify the clinical relevance using Med-LLMs st G(yl) 8 Put {xv, y, yl, st} into Do; 9 Obtain the heatmap of lesion region h T (xv) 10 Save the confidence score from visual tool sv P(h\|xv) 11 Add noise to the localized region x v N(xv, h) 12 Put {xv, x v, y, sv} into Do; 13 Clinical Preference Optimization 14 foreach (x, x , yw, yl, s) Do do 15 Normalize the score s Z(s) 16 Update πθ through Eq. (3)
Open Source Code	Yes	Our code are available in https://github.com/aiminglab/MMed PO.
Open Datasets	Yes	Evaluation Datasets. To verify the effectiveness of MMed PO in improving factuality, we utilize four medical datasets: two medical VQA datasets, i.e., VQA-RAD (Lau et al., 2018) and SLAKE (Liu et al., 2021), and two report generation datasets, i.e., MIMIC-CXR (Johnson et al., 2020) and IU-Xray (Demner Fushman et al., 2016). ... Appendix A.2. Involved Datasets We leverage four open-source medical vision-language datasets: MIMIC-CXR (Johnson et al., 2020), IU-Xray (Demner Fushman et al., 2016), SLAKE (Liu ets al., 2021), and VQA-RAD (Lau et al., 2018).
Dataset Splits	Yes	Evaluation Datasets. To verify the effectiveness of MMed PO in improving factuality, we utilize four medical datasets: two medical VQA datasets, i.e., VQA-RAD (Lau et al., 2018) and SLAKE (Liu et al., 2021), and two report generation datasets, i.e., MIMIC-CXR (Johnson et al., 2020) and IU-Xray (Demner Fushman et al., 2016). ... Appendix A.1. Data Statistics The data statistics are shown in Table 5 and Table 6. In the training datasets, the reported quantities for the two datasets in report generation represent image-report pairs, while the quantities for the two datasets in the medical VQA task represent question-answer pairs. Table 5. Data statistics for the training set of four datasets under two different task settings. Train (visual) refers to the number of visual-only preference data, while Train (text) indicates the number of text-only preference data. [Table 5 contents] Table 6. Data statistics of test set. #Images, #QA items and #Reports mean the number of images, QA pairs and reports, respectively. [Table 6 contents]
Hardware Specification	Yes	All experiments are implemented using Py Torch 2.1.2 on four NVIDIA RTX A6000 GPUs, with training requiring approximately 2 to 3 hours.
Software Dependencies	Yes	All experiments are implemented using Py Torch 2.1.2 on four NVIDIA RTX A6000 GPUs, with training requiring approximately 2 to 3 hours.
Experiment Setup	Yes	Implementation Details. We utilize LLa VA-Med-1.5 7B (Li et al., 2023a) as the base model. During the preference optimization stage, we apply Lo RA fine-tuning (Hu et al., 2021), with a batch size of 4, a learning rate of 1e-7, and train for 3 epochs. For curating preference data, we use GPT-4o to evaluate and generate dispreferred responses. In the multi-agent collaboration system, multiple Med-LLMs, including LLa MA3-Med42-7B (Christophe et al., 2024), LLa MA3-Med42-70B, Bio Mistral-7B (Labrak et al., 2024), are used to evaluate the relevance scores for the preference data. See Appendix B for more details. ... Appendix B. Hyperparameter Settings For the usage of visual tools, we employ disease as the text description to guide Med KLIP (Wu et al., 2023b) in generating heatmaps. For multi-agent collaboration, the process is conducted over 5 rounds . During score normalization, the parameters are set as: α = 0.75, β = 1.25, µ = 1, and σ2 = 0.1. All hyperparameters are kept consistent across the experiments to eliminate any potential bias introduced by hyperparameter tuning.