reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Authors: Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models ability to strictly follow instructions without compromising performance on other tasks.
Researcher Affiliation	Collaboration	Yusu Qian1, Hanrong Ye1,2, Jean-Philippe Fauconnier1, Peter Grasch1, Yinfei Yang1, Zhe Gan1 1Apple 2HKUST EMAIL
Pseudocode	No	The paper describes the methodology for constructing MIA-Bench, instruction categories, and evaluation methods using text and figures, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Benchmark data and evaluation code: https://github.com/apple/ml-mia-bench. For reproducibility purpose, we release our evaluation code and benchmark at: https://github. com/apple/ml-mia-bench.
Open Datasets	Yes	MIA-Bench consists of 400 image-prompt pairs... The images are collected from diverse sources, including COCO 2017 validation set (Lin et al., 2015), SBU (Ordonez et al., 2011), Text VQA (Singh et al., 2019a), and Flickr.
Dataset Splits	Yes	First, we randomly sample 1000 images from COCO 2017 training set, and use GPT-4v to generate five instructions for each image, using the prompt below.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions various multimodal LLMs and models used for evaluation (e.g., GPT-4o, Claude-3) but does not provide specific version numbers for software dependencies or development environments used for the authors' own work.
Experiment Setup	Yes	Using LLa VA-Ne XT-13b as the backbone, we train the model for 1 epoch on the constructed diverse instruction-tuning (DIT) data.