MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Authors: Fei Wang, XINGYU FU, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Yan, Wenjie Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce MUIRBENCH, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. ... Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MUIRBENCH, achieving 68.0% and 49.3% in accuracy. |
| Researcher Affiliation | Collaboration | 1USC 2UPenn 3UMN 4UC Davis 5UW Madison 6UCLA 7OSU 8Microsoft Research |
| Pseudocode | No | The paper describes the methodology in prose, but there are no structured pseudocode blocks or algorithms explicitly labeled within the text. |
| Open Source Code | No | The paper states: 'Project page: https://muirbench.github.io/' and 'The evaluation code and outputs will be provided to facilitate easy reproduction and analyses of the results in the paper.' However, it does not provide a direct link to a source-code repository, nor does it explicitly state the code is available in supplementary materials or immediately accessible. |
| Open Datasets | Yes | We introduce MUIRBENCH, a comprehensive benchmark...MUIRBENCH is hosted on Huggingface/Datasets, where license and metadata are also available. We maintain our benchmark on this page and will continually update it. ... Existing data (40.8%) come from Gene CIS (Vaze et al., 2023), Seed Bench (Li et al., 2023), and Icon QA (Lu et al., 2021b). Derived data (21.7%) reformat data into MCQA format... upon instances from NLVR2 (Suhr et al., 2019), Hallusion Bench (Guan et al., 2023), ISVQA (Bansal et al., 2020), and MMBench (Liu et al., 2023c). New data (37.5%) address certain tasks... based on images from the National Geologic Map Database, University-1652 (Zheng et al., 2020; 2023), Pub Med papers, and Sci Duet slides (Sun et al., 2021). |
| Dataset Splits | Yes | MUIRBENCH consists of 11,264 images and 2,600 multiple-choice questions... Answerable Instances 1300... Unanswerable Instances 1300... This step doubles the size of data, leading to a balanced distribution of answerable and unanswerable instances. |
| Hardware Specification | No | The paper describes the experimental setup and lists the multimodal LLMs evaluated, but it does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | Yes | We follow the standard setup as it is in VLMEval Kit (Contributors, 2023a)... We use a rule-based automatic tool to extract the exact answer. We refer the readers to Appendix D for more details... Footnote 8: https://github.com/MMMU-Benchmark/MMMU/blob/f3e473e1e7af2c65a56ab66d7b3cf09c5dbaf0b9/ eval/utils/eval_utils.py#L10 |
| Experiment Setup | Yes | We follow the standard setup as it is in VLMEval Kit (Contributors, 2023a), where the temperature is set to 0 and retry is set to 10. For the models that do not support multiple images as input, we concatenate the images to constitute one input... Our prompt consists of four parts, the question, options, the hint indicating the answer format, and a prefix indicating the beginning of the answer. For images, we insert them into the text to form a coherent prompt. |