reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WorldSimBench: Towards Video Generation Models as World Simulators

Authors: Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Ruimao Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence. (3)We conducted extensive testing across multiple models and performed a thorough analysis of the experimental results.
Researcher Affiliation	Academia	1Sun Yat-sen University 2The Chinese University of Hong Kong, Shenzhen 3Shanghai Artificial Intelligence Laboratory 4Beihang University 5The University of Hong Kong 6University of Oxford 7Guangdong Key Laboratory of Big Data Analysis and Processing. Correspondence to: Jing Shao <EMAIL>, Lei Bai <EMAIL>, Ruimao Zhang <EMAIL>.
Pseudocode	No	The paper describes methods and processes in narrative text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Project Page: https://iranqin.github.io/World Sim Bench.github.io. The paper provides a project page URL, but it does not explicitly state that the source code for the methodology is available there or provide a direct link to a code repository.
Open Datasets	Yes	The dataset is constructed based on three key resources, each corresponding to a specific embodied scenario: a curated dataset of Minecraft videos from the internet for OE (Baker et al., 2022), real-world driving data for AD (Caesar et al., 2020), and real-world robot manipulation videos annotated with text instructions for RM (Chen et al., 2024). We employ Mine RL as the Minecraft simulator... (Guss et al., 2019), We conduct standard closed-loop evaluations using the CARLA (Dosovitskiy et al., 2017) simulator..., We employ CALVIN (Mees et al., 2022) as the robot manipulation simulator...
Dataset Splits	Yes	Testing. We evaluated performance in Robot Manipulation using the CALVIN benchmark benchmark, policy models are trained on demonstrations from environments A, B, and C, and evaluated in a zero-shot manner in environment D.
Hardware Specification	Yes	The training is conducted on 4 A100 80 GPUs.
Software Dependencies	No	The paper mentions software components like Flash-VStream, LoRA, and AdamW, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We maintain consistent training settings in all three scenarios, with a video sampling frequency of 4. The Lo RA settings aligned with those in Flash-VStream. We use Adam W as the optimizer, employ cosine decay for the learning rate scheduler. We train for 4 epochs with a learning rate of 2e-5 and a warmup ratio of 0.03.