Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
Authors: Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of Thought prompting and test-time compute scaling underperforming. |
| Researcher Affiliation | Collaboration | 1University of Electronic Science and Technology of China 2Sun Yat-sen University 3University of Washington 4Microsoft 5The Chinese University of Hong Kong. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methodology is described in narrative text. |
| Open Source Code | No | The project homepage can be accessed at https: //emma-benchmark.github.io/. This is a project homepage, which might contain code, but it is not an explicit statement of code release or a direct link to a code repository for the methodology itself, which is benchmark creation and evaluation. |
| Open Datasets | Yes | We introduce EMMA (Enhanced Multi Modal re Asoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. ... The project homepage can be accessed at https: //emma-benchmark.github.io/. |
| Dataset Splits | Yes | To create a more balanced subset of EMMA, we randomly sample 400 questions (100 per subject) from the benchmark, hereafter referred to as EMMA-mini. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments. |
| Software Dependencies | Yes | We are using Python version 3.11.0, matplotlib version 3.6.3, and seaborn version 0.12.2 (if applicable). |
| Experiment Setup | Yes | For all models except o1, QVQ, and Gemini 2.0 Flash Thinking, we test two prompting strategies: (1) Direct prompting, which instructs models to output the answers without reasoning steps; and (2) Chain-of-Thought (Co T) prompting (Wei et al., 2022), where we prompt models to think step-by-step. |