WorldSimBench: Towards Video Generation Models as World Simulators

Authors: Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Ruimao Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence. (3)We conducted extensive testing across multiple models and performed a thorough analysis of the experimental results.
Researcher Affiliation Academia 1Sun Yat-sen University 2The Chinese University of Hong Kong, Shenzhen 3Shanghai Artificial Intelligence Laboratory 4Beihang University 5The University of Hong Kong 6University of Oxford 7Guangdong Key Laboratory of Big Data Analysis and Processing. Correspondence to: Jing Shao <EMAIL>, Lei Bai <EMAIL>, Ruimao Zhang <EMAIL>.
Pseudocode No The paper describes methods and processes in narrative text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Project Page: https://iranqin.github.io/World Sim Bench.github.io. The paper provides a project page URL, but it does not explicitly state that the source code for the methodology is available there or provide a direct link to a code repository.
Open Datasets Yes The dataset is constructed based on three key resources, each corresponding to a specific embodied scenario: a curated dataset of Minecraft videos from the internet for OE (Baker et al., 2022), real-world driving data for AD (Caesar et al., 2020), and real-world robot manipulation videos annotated with text instructions for RM (Chen et al., 2024). We employ Mine RL as the Minecraft simulator... (Guss et al., 2019), We conduct standard closed-loop evaluations using the CARLA (Dosovitskiy et al., 2017) simulator..., We employ CALVIN (Mees et al., 2022) as the robot manipulation simulator...
Dataset Splits Yes Testing. We evaluated performance in Robot Manipulation using the CALVIN benchmark benchmark, policy models are trained on demonstrations from environments A, B, and C, and evaluated in a zero-shot manner in environment D.
Hardware Specification Yes The training is conducted on 4 A100 80 GPUs.
Software Dependencies No The paper mentions software components like Flash-VStream, LoRA, and AdamW, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We maintain consistent training settings in all three scenarios, with a video sampling frequency of 4. The Lo RA settings aligned with those in Flash-VStream. We use Adam W as the optimizer, employ cosine decay for the learning rate scheduler. We train for 4 epochs with a learning rate of 2e-5 and a warmup ratio of 0.03.