reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Authors: Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Wang, Lijuan Wang, Xin Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The evaluation includes 4 proprietary and 11 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4o performs the best with only 62.5% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models different skill sets from humans.
Researcher Affiliation	Collaboration	Xuehai He1, Weixi Feng2, Kaizhi Zheng1, Yujie Lu2, Wanrong Zhu2, Jiachen Li2, Yue Fan2, Jianfeng Wang3, Linjie Li3, Zhengyuan Yang3, Kevin Lin3, William Yang Wang2, Lijuan Wang3, Xin Eric Wang1 1UCSC, 2UCSB, 3Microsoft Correspondence: xhe89,EMAIL
Pseudocode	No	The paper only describes steps in regular paragraph text without structured formatting that resembles pseudocode or algorithm blocks.
Open Source Code	No	No explicit statement about providing open-source code for the methodology or benchmark described in the paper is found. The only GitHub link refers to a third-party tool (Katna) used in their pipeline, not their own implementation code.
Open Datasets	Yes	MMWorld consists of a human-annotated dataset... and a synthetic dataset... The datasets used are available in the supplementary material, and their collection and annotation steps are described in Section 3 of the paper.
Dataset Splits	No	The paper introduces MMWorld as a benchmark for multi-discipline, multifaceted multimodal video understanding. It states, 'MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception.' While it describes the composition of MMWorld, it does not provide explicit training, validation, or test splits for this benchmark dataset for reproducing model training experiments.
Hardware Specification	Yes	All inferences are run on a NVIDIA A6000 workstation.
Software Dependencies	No	The paper describes the models used and their default settings but does not explicitly list specific versions for programming languages, libraries (e.g., PyTorch, TensorFlow), or other software components used for implementation, beyond mentioning GPT-4-32K as a judge.
Experiment Setup	Yes	For Panda GPT, we set top p to 0.7 and temperature to 0.5. For Video Chat, we set max frames to 100. For X-Instruct-BLIP, the model is implemented using four image frames.