MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Authors: Peng Xia, Siwei Han, Shi Qiu, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields... Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs.
Researcher Affiliation Collaboration 1UNC-Chapel Hill, 2Microsoft Research, 3University of Chicago, 4NUS
Pseudocode No The paper describes methods and processes in paragraph form and through diagrams (e.g., Figure 3: Pipeline of the scoring model), but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code Yes We publicly release our benchmark and code in https://mmie-bench.github.io/.
Open Datasets Yes To address these limitations, we introduce MMIE, a Massive Multimodal Inverleaved understanding Evaluation benchmark for LVLMs with proposed reliable and automated metrics. MMIE is curated from four multimodal datasets... We publicly release our benchmark and code in https://mmie-bench.github.io/. ... In the first stage, we collect and restructure four multimodal datasets to align with the interleaved image-and-text format and categorize them into three categories situational analysis, project-based learning and multi-step reasoning... We extract data from Wikihow (Yang et al., 2021)... samples from VIST (Huang et al., 2016)... source examples from Math Vista (Lu et al., 2024) and Re MI (Kazemi et al., 2024).
Dataset Splits Yes To validate its performance, we randomly select 200 new samples with human-scored labels and compare the results of our model with those of other scoring models. ... Finally, we create a dataset of 1K examples with evaluation scores through human annotation, with 800 examples used for fine-tuning the scoring model and 200 examples for evaluating the scoring model.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments or training the models.
Software Dependencies No The paper mentions various models and tools used (e.g., Intern VL-2-4B, Mini GPT-5, EMU-2, GPT-4o, Stable Diffusion 3), often citing their respective papers, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) required for reproduction.
Experiment Setup No The paper describes the baseline models, the integration of LVLMs with text-to-image models, and the evaluation metrics used. However, it does not provide specific experimental setup details such as hyperparameters (e.g., learning rates, batch sizes, number of epochs) for training or fine-tuning any of the models, including their proposed scoring model.