reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Authors: Peng Xia, Siwei Han, Shi Qiu, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields... Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs.
Researcher Affiliation	Collaboration	1UNC-Chapel Hill, 2Microsoft Research, 3University of Chicago, 4NUS
Pseudocode	No	The paper describes methods and processes in paragraph form and through diagrams (e.g., Figure 3: Pipeline of the scoring model), but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	We publicly release our benchmark and code in https://mmie-bench.github.io/.
Open Datasets	Yes	To address these limitations, we introduce MMIE, a Massive Multimodal Inverleaved understanding Evaluation benchmark for LVLMs with proposed reliable and automated metrics. MMIE is curated from four multimodal datasets... We publicly release our benchmark and code in https://mmie-bench.github.io/. ... In the first stage, we collect and restructure four multimodal datasets to align with the interleaved image-and-text format and categorize them into three categories situational analysis, project-based learning and multi-step reasoning... We extract data from Wikihow (Yang et al., 2021)... samples from VIST (Huang et al., 2016)... source examples from Math Vista (Lu et al., 2024) and Re MI (Kazemi et al., 2024).
Dataset Splits	Yes	To validate its performance, we randomly select 200 new samples with human-scored labels and compare the results of our model with those of other scoring models. ... Finally, we create a dataset of 1K examples with evaluation scores through human annotation, with 800 examples used for fine-tuning the scoring model and 200 examples for evaluating the scoring model.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments or training the models.
Software Dependencies	No	The paper mentions various models and tools used (e.g., Intern VL-2-4B, Mini GPT-5, EMU-2, GPT-4o, Stable Diffusion 3), often citing their respective papers, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) required for reproduction.
Experiment Setup	No	The paper describes the baseline models, the integration of LVLMs with text-to-image models, and the evaluation metrics used. However, it does not provide specific experimental setup details such as hyperparameters (e.g., learning rates, batch sizes, number of epochs) for training or fine-tuning any of the models, including their proposed scoring model.