MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Authors: Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments (Fig. 1(b)) demonstrate that MIA-DPO is agnostic to different LVLM architectures (LLa VA-v1.5 (Liu et al., 2024a) and Intern LM-XC2.5 (Zhang et al., 2024)), boosts the performance on multiple multi-image benchmarks while maintaining the original single-image understanding capabilities. [...] We evaluate our method on the following representative benchmarks. First, we select five multi-image benchmarks: MMMU (Yue et al., 2024), BLINK (Fu et al., 2024), Mantis (Jiang et al., 2024), NLVR2 (Suhr et al., 2018), and MVBench (Li et al., 2024c). [...] We also test the model on several single-image benchmarks: MMStar (Chen et al., 2024a), Science QA (Lu et al., 2022), MMVet (Yu et al., 2023), POPE (Li et al., 2023c), MMBench (Liu et al., 2023), Math Vista (Lu et al., 2023), AI2D (Kembhavi et al., 2016), and OCRBench (Liu et al., 2024c).
Researcher Affiliation Collaboration Ziyu Liu1,2, Yuhang Zang2B, Xiaoyi Dong2, Pan Zhang2, Yuhang Cao2, Haodong Duan2, Conghui He2, Yuanjun Xiong4, Dahua Lin2,3,6, Jiaqi Wang2,5B 1 SJTU, 2 Shanghai AI Laboratory, 3CUHK, 4 MThreads, Inc, 5 Shanghai Innovation Institute, 6 CPII under Inno HK EMAIL, EMAIL
Pseudocode No The paper describes the MIA-DPO framework through textual explanations, diagrams (Figures 1, 3, 4), and mathematical formulations (Equations 1, 2, 3, 5). However, it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Github: https://github.com/Liuziyu77/MIA-DPO
Open Datasets Yes we efficiently convert existing single-image datasets, such as LLa VA-665k (Liu et al., 2024a). [...] We evaluate our method on the following representative benchmarks. First, we select five multi-image benchmarks: MMMU (Yue et al., 2024), BLINK (Fu et al., 2024), Mantis (Jiang et al., 2024), NLVR2 (Suhr et al., 2018), and MVBench (Li et al., 2024c). [...] Subsequently, we also test the model on several single-image benchmarks: MMStar (Chen et al., 2024a), Science QA (Lu et al., 2022), MMVet (Yu et al., 2023), POPE (Li et al., 2023c), MMBench (Liu et al., 2023), Math Vista (Lu et al., 2023), AI2D (Kembhavi et al., 2016), and OCRBench (Liu et al., 2024c).
Dataset Splits No In constructing our MIA-DPO dataset with three types of multi-image data (Sequence Data, Grid Collage Data, and Pic-in-Pic Data), we used the LLa Va665k (Liu et al., 2024b) dataset as the foundational single-image data. [...] The final data volume used for DPO is summarized in Tab. 8. [...] We constructed a VQA test set of 500 questions using images and questions from LLa VA-665k but are mutually exclusive with the MIA-DPO training data.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running the experiments. It mentions 'All single-image experimental results presented in Tab 2 are obtained using the VLMEval Kit (Duan et al., 2024)' but does not detail the hardware setup for these evaluations or for the main training.
Software Dependencies No The paper mentions using 'VLMEval Kit' for evaluations and refers to DPO algorithms and models like LLaVA-v1.5 and Intern LM-XC2.5, but it does not specify any software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The models are trained on 3 epochs, with a learning rate of 5e 5, temperature parameter (in Eq. 3) β = 0.1, and NLL loss coefficient (in Eq. 5) γ = 0.1. For more experimental details, please refer to appendix Sec. A.