Empowering World Models with Reflection for Embodied Video Prediction
Authors: Xiaowei Chi, Chun-Kai Fan, Hengyuan Zhang, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi-Min Chan, Wei Xue, Qifeng Liu, Shanghang Zhang, Yike Guo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large-scale pre-trained models in real-world video prediction applications. The video demos are available at https://sites.google.com/view/icml-eva. Extensive experiments on EVA-Bench highlight the strong performance of Ro G in both in-domain and OOD tasks. Furthermore, to validate the applicability of EVA in robot planning, we evaluate the model using a robot simulator (Mees et al., 2022; Brohan et al., 2022), demonstrating that EVA and Ro G effectively support real-world task execution. |
| Researcher Affiliation | Academia | 1Hong Kong University of Science and Technology 2State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University. Correspondence to: Shanghang Zhang <EMAIL>, Yike Guo <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Reflection of Generation World Model (WRo G) Input: Initial observation O0, task instruction I Output: Sequence of predictions {Ot}T t=1 Initialize: Prediction sequence {Ot} repeat H Encode(O0, I, {Ot}) # Understanding Module encodes input states ˆV Reflect(H) # Generate reflection output based on H if ˆV = Extend({Ot}) then Extend the prediction sequence {Ot} else if ˆV = Regenerate({Ot}) then Regenerate the prediction sequence {Ot} else if ˆV = Output({Ot}) then Finalize and output the prediction sequence {Ot} break end if until Prediction sequence converges or breaks |
| Open Source Code | No | The video demos are available at https://sites.google.com/view/icml-eva. |
| Open Datasets | Yes | The complete EVA instruction tuning dataset comprises 500K QA pairs collected from Open-X-Embodiment (Padalkar et al., 2023), Ego4d (Grauman et al., 2022), Ego-Exo4d (Grauman et al., 2024), and CALVIN (Mees et al., 2022). |
| Dataset Splits | No | For the Finish-Think dataset, we use the first 25% of videos to identify unfinished tasks and convert relevant Robo VQA (Sermanet et al., 2024) questions into this format. The complete EVA instruction tuning dataset comprises 500K QA pairs. Detailed information, including prompt structures, dataset ratios, and data examples, is provided in Appendix C and anonymous supplementary pages. EVA-Bench includes 125 meticulously curated high-quality samples from our EVA-Instruct dataset. |
| Hardware Specification | Yes | Training hardware 8 Nvidia A800 chips |
| Software Dependencies | No | Optimizer Adam (β1 = 0.9, β2 = 0.999) |
| Experiment Setup | Yes | Table 7. Hyperparameters for training EVA diffusion model. Hyperparameter Value Base channels 320 Optimizer Adam (β1 = 0.9, β2 = 0.999) Channel multipliers 1, 2, 4, 4 Learning rate 0.0001 Blocks per resolution 2 Batch size 4 Attention resolutions 4, 2, 1 Num attention heads 64 Conditioning embedding dimension 4096 Conditioning embedding MLP layers 4 Conditioning token length 64 Dropout 0.1 Training hardware 8 Nvidia A800 chips Training steps 20000 Diffusion noise schedule cosine Noise schedule log SNR range [-20, 20] Sampling timesteps 50 Sampling log-variance interpolation γ = 0.1 Weight decay 0.0 Prediction target ϵ |