reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Authors: Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of Thought prompting and test-time compute scaling underperforming.
Researcher Affiliation	Collaboration	1University of Electronic Science and Technology of China 2Sun Yat-sen University 3University of Washington 4Microsoft 5The Chinese University of Hong Kong.
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methodology is described in narrative text.
Open Source Code	No	The project homepage can be accessed at https: //emma-benchmark.github.io/. This is a project homepage, which might contain code, but it is not an explicit statement of code release or a direct link to a code repository for the methodology itself, which is benchmark creation and evaluation.
Open Datasets	Yes	We introduce EMMA (Enhanced Multi Modal re Asoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. ... The project homepage can be accessed at https: //emma-benchmark.github.io/.
Dataset Splits	Yes	To create a more balanced subset of EMMA, we randomly sample 400 questions (100 per subject) from the benchmark, hereafter referred to as EMMA-mini.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies	Yes	We are using Python version 3.11.0, matplotlib version 3.6.3, and seaborn version 0.12.2 (if applicable).
Experiment Setup	Yes	For all models except o1, QVQ, and Gemini 2.0 Flash Thinking, we test two prompting strategies: (1) Direct prompting, which instructs models to output the answers without reasoning steps; and (2) Chain-of-Thought (Co T) prompting (Wei et al., 2022), where we prompt models to think step-by-step.