Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Authors: Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of Thought prompting and test-time compute scaling underperforming.
Researcher Affiliation Collaboration 1University of Electronic Science and Technology of China 2Sun Yat-sen University 3University of Washington 4Microsoft 5The Chinese University of Hong Kong.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methodology is described in narrative text.
Open Source Code No The project homepage can be accessed at https: //emma-benchmark.github.io/. This is a project homepage, which might contain code, but it is not an explicit statement of code release or a direct link to a code repository for the methodology itself, which is benchmark creation and evaluation.
Open Datasets Yes We introduce EMMA (Enhanced Multi Modal re Asoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. ... The project homepage can be accessed at https: //emma-benchmark.github.io/.
Dataset Splits Yes To create a more balanced subset of EMMA, we randomly sample 400 questions (100 per subject) from the benchmark, hereafter referred to as EMMA-mini.
Hardware Specification No The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies Yes We are using Python version 3.11.0, matplotlib version 3.6.3, and seaborn version 0.12.2 (if applicable).
Experiment Setup Yes For all models except o1, QVQ, and Gemini 2.0 Flash Thinking, we test two prompting strategies: (1) Direct prompting, which instructs models to output the answers without reasoning steps; and (2) Chain-of-Thought (Co T) prompting (Wei et al., 2022), where we prompt models to think step-by-step.