MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Authors: Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Wang, Lijuan Wang, Xin Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The evaluation includes 4 proprietary and 11 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4o performs the best with only 62.5% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models different skill sets from humans.
Researcher Affiliation Collaboration Xuehai He1, Weixi Feng2, Kaizhi Zheng1, Yujie Lu2, Wanrong Zhu2, Jiachen Li2, Yue Fan2, Jianfeng Wang3, Linjie Li3, Zhengyuan Yang3, Kevin Lin3, William Yang Wang2, Lijuan Wang3, Xin Eric Wang1 1UCSC, 2UCSB, 3Microsoft Correspondence: xhe89,EMAIL
Pseudocode No The paper only describes steps in regular paragraph text without structured formatting that resembles pseudocode or algorithm blocks.
Open Source Code No No explicit statement about providing open-source code for the methodology or benchmark described in the paper is found. The only GitHub link refers to a third-party tool (Katna) used in their pipeline, not their own implementation code.
Open Datasets Yes MMWorld consists of a human-annotated dataset... and a synthetic dataset... The datasets used are available in the supplementary material, and their collection and annotation steps are described in Section 3 of the paper.
Dataset Splits No The paper introduces MMWorld as a benchmark for multi-discipline, multifaceted multimodal video understanding. It states, 'MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception.' While it describes the composition of MMWorld, it does not provide explicit training, validation, or test splits for this benchmark dataset for reproducing model training experiments.
Hardware Specification Yes All inferences are run on a NVIDIA A6000 workstation.
Software Dependencies No The paper describes the models used and their default settings but does not explicitly list specific versions for programming languages, libraries (e.g., PyTorch, TensorFlow), or other software components used for implementation, beyond mentioning GPT-4-32K as a judge.
Experiment Setup Yes For Panda GPT, we set top p to 0.7 and temperature to 0.5. For Video Chat, we set max frames to 100. For X-Instruct-BLIP, the model is implemented using four image frames.