reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Authors: Fei Wang, XINGYU FU, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Yan, Wenjie Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce MUIRBENCH, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. ... Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MUIRBENCH, achieving 68.0% and 49.3% in accuracy.
Researcher Affiliation	Collaboration	1USC 2UPenn 3UMN 4UC Davis 5UW Madison 6UCLA 7OSU 8Microsoft Research
Pseudocode	No	The paper describes the methodology in prose, but there are no structured pseudocode blocks or algorithms explicitly labeled within the text.
Open Source Code	No	The paper states: 'Project page: https://muirbench.github.io/' and 'The evaluation code and outputs will be provided to facilitate easy reproduction and analyses of the results in the paper.' However, it does not provide a direct link to a source-code repository, nor does it explicitly state the code is available in supplementary materials or immediately accessible.
Open Datasets	Yes	We introduce MUIRBENCH, a comprehensive benchmark...MUIRBENCH is hosted on Huggingface/Datasets, where license and metadata are also available. We maintain our benchmark on this page and will continually update it. ... Existing data (40.8%) come from Gene CIS (Vaze et al., 2023), Seed Bench (Li et al., 2023), and Icon QA (Lu et al., 2021b). Derived data (21.7%) reformat data into MCQA format... upon instances from NLVR2 (Suhr et al., 2019), Hallusion Bench (Guan et al., 2023), ISVQA (Bansal et al., 2020), and MMBench (Liu et al., 2023c). New data (37.5%) address certain tasks... based on images from the National Geologic Map Database, University-1652 (Zheng et al., 2020; 2023), Pub Med papers, and Sci Duet slides (Sun et al., 2021).
Dataset Splits	Yes	MUIRBENCH consists of 11,264 images and 2,600 multiple-choice questions... Answerable Instances 1300... Unanswerable Instances 1300... This step doubles the size of data, leading to a balanced distribution of answerable and unanswerable instances.
Hardware Specification	No	The paper describes the experimental setup and lists the multimodal LLMs evaluated, but it does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	Yes	We follow the standard setup as it is in VLMEval Kit (Contributors, 2023a)... We use a rule-based automatic tool to extract the exact answer. We refer the readers to Appendix D for more details... Footnote 8: https://github.com/MMMU-Benchmark/MMMU/blob/f3e473e1e7af2c65a56ab66d7b3cf09c5dbaf0b9/ eval/utils/eval_utils.py#L10
Experiment Setup	Yes	We follow the standard setup as it is in VLMEval Kit (Contributors, 2023a), where the temperature is set to 0 and retry is set to 10. For the models that do not support multiple images as input, we concatenate the images to constitute one input... Our prompt consists of four parts, the question, options, the hint indicating the answer format, and a prefix indicating the beginning of the answer. For images, we insert them into the text to form a coherent prompt.