reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Empowering World Models with Reflection for Embodied Video Prediction

Authors: Xiaowei Chi, Chun-Kai Fan, Hengyuan Zhang, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi-Min Chan, Wei Xue, Qifeng Liu, Shanghang Zhang, Yike Guo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large-scale pre-trained models in real-world video prediction applications. The video demos are available at https://sites.google.com/view/icml-eva. Extensive experiments on EVA-Bench highlight the strong performance of Ro G in both in-domain and OOD tasks. Furthermore, to validate the applicability of EVA in robot planning, we evaluate the model using a robot simulator (Mees et al., 2022; Brohan et al., 2022), demonstrating that EVA and Ro G effectively support real-world task execution.
Researcher Affiliation	Academia	1Hong Kong University of Science and Technology 2State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University. Correspondence to: Shanghang Zhang <EMAIL>, Yike Guo <EMAIL>.
Pseudocode	Yes	Algorithm 1 Reflection of Generation World Model (WRo G) Input: Initial observation O0, task instruction I Output: Sequence of predictions {Ot}T t=1 Initialize: Prediction sequence {Ot} repeat H Encode(O0, I, {Ot}) # Understanding Module encodes input states ˆV Reflect(H) # Generate reflection output based on H if ˆV = Extend({Ot}) then Extend the prediction sequence {Ot} else if ˆV = Regenerate({Ot}) then Regenerate the prediction sequence {Ot} else if ˆV = Output({Ot}) then Finalize and output the prediction sequence {Ot} break end if until Prediction sequence converges or breaks
Open Source Code	No	The video demos are available at https://sites.google.com/view/icml-eva.
Open Datasets	Yes	The complete EVA instruction tuning dataset comprises 500K QA pairs collected from Open-X-Embodiment (Padalkar et al., 2023), Ego4d (Grauman et al., 2022), Ego-Exo4d (Grauman et al., 2024), and CALVIN (Mees et al., 2022).
Dataset Splits	No	For the Finish-Think dataset, we use the first 25% of videos to identify unfinished tasks and convert relevant Robo VQA (Sermanet et al., 2024) questions into this format. The complete EVA instruction tuning dataset comprises 500K QA pairs. Detailed information, including prompt structures, dataset ratios, and data examples, is provided in Appendix C and anonymous supplementary pages. EVA-Bench includes 125 meticulously curated high-quality samples from our EVA-Instruct dataset.
Hardware Specification	Yes	Training hardware 8 Nvidia A800 chips
Software Dependencies	No	Optimizer Adam (β1 = 0.9, β2 = 0.999)
Experiment Setup	Yes	Table 7. Hyperparameters for training EVA diffusion model. Hyperparameter Value Base channels 320 Optimizer Adam (β1 = 0.9, β2 = 0.999) Channel multipliers 1, 2, 4, 4 Learning rate 0.0001 Blocks per resolution 2 Batch size 4 Attention resolutions 4, 2, 1 Num attention heads 64 Conditioning embedding dimension 4096 Conditioning embedding MLP layers 4 Conditioning token length 64 Dropout 0.1 Training hardware 8 Nvidia A800 chips Training steps 20000 Diffusion noise schedule cosine Noise schedule log SNR range [-20, 20] Sampling timesteps 50 Sampling log-variance interpolation γ = 0.1 Weight decay 0.0 Prediction target ϵ