MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Authors: Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models ability to strictly follow instructions without compromising performance on other tasks. |
| Researcher Affiliation | Collaboration | Yusu Qian1, Hanrong Ye1,2, Jean-Philippe Fauconnier1, Peter Grasch1, Yinfei Yang1, Zhe Gan1 1Apple 2HKUST EMAIL |
| Pseudocode | No | The paper describes the methodology for constructing MIA-Bench, instruction categories, and evaluation methods using text and figures, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Benchmark data and evaluation code: https://github.com/apple/ml-mia-bench. For reproducibility purpose, we release our evaluation code and benchmark at: https://github. com/apple/ml-mia-bench. |
| Open Datasets | Yes | MIA-Bench consists of 400 image-prompt pairs... The images are collected from diverse sources, including COCO 2017 validation set (Lin et al., 2015), SBU (Ordonez et al., 2011), Text VQA (Singh et al., 2019a), and Flickr. |
| Dataset Splits | Yes | First, we randomly sample 1000 images from COCO 2017 training set, and use GPT-4v to generate five instructions for each image, using the prompt below. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions various multimodal LLMs and models used for evaluation (e.g., GPT-4o, Claude-3) but does not provide specific version numbers for software dependencies or development environments used for the authors' own work. |
| Experiment Setup | Yes | Using LLa VA-Ne XT-13b as the backbone, we train the model for 1 epoch on the constructed diverse instruction-tuning (DIT) data. |