History-Guided Video Diffusion

Authors: Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the state-of-the-art performance and new capabilities enabled by our method, especially in long video generation. Additionally, we provide a theoretical justification of the training objective through a variational lower bound. We empirically evaluate the performance of the Diffusion Forcing Transformer and history guidance. We first validate the DFo T as a generic video model without history guidance (Sec. 6.2), demonstrating the effectiveness of the modified training objective. Next, we examine the effectiveness and additional capabilities of history guidance (Secs. 6.3 and 6.4). Finally, we showcase very long videos generated by DFo T with history guidance (Sec. 6.5). Quantitative and qualitative results are extensively discussed in the paper including FVD and VBench scores.
Researcher Affiliation Academia 1MIT 2Carnegie Mellon University 3Harvard University. Correspondence to: Kiwhan Song <EMAIL>, Boyuan Chen <EMAIL>.
Pseudocode Yes Algorithm 1 Flexible Sampling with DFo T and (optionally) History Guidance
Open Source Code No Project website: https: //boyuan.space/history-guidance
Open Datasets Yes Kinetics-600 (Kay et al. (2017), 128 128), a standard video prediction benchmark, Real Estate10K or RE10K (Zhou et al. (2018), 256 256), a dataset of real-world indoor scenes with camera pose annotations, and Minecraft (Yan et al. (2023), 256 256), a dataset of long-context Minecraft navigation videos with discrete actions.
Dataset Splits Yes On the test split of the dataset, we evaluate the models on a video prediction task, where the model is conditioned on the first 5 history frames and asked to predict the next 11 frames. For the Kinetics-600 rollout experiment, the models generate the next 59 frames using sliding windows, given the first 5 history frames. The sliding windows are applied such that the model is always conditioned on the last 2 latent tokens and generates the next 3 latent tokens.
Hardware Specification Yes We utilize 12 H100 GPUs for training all of our video diffusion models, with each model requiring approximately 5 days to train under our chosen batch size. One exception is the Robot model, which is trained on 4 RTX4090 GPUs for 4 hours.
Software Dependencies No The paper mentions software components like "Adam W", "DDIM sampler", "cosine noise schedule", "v-parameterization", "min-SNR loss reweighting", "sigmoid loss reweighting", "fp16 precision" but does not specify their version numbers or specific library versions used (e.g., PyTorch, Python).
Experiment Setup Yes We train models for each dataset and for each model class (e.g., DFo T, SD, etc.), using the same pipeline within each dataset. We apply a frame skip, where training video clips are subsampled by a specific stride: a value of 1 for Kinetics-600, 2 for Minecraft, and 1 for Imitation Learning. For Real Estate10K, we use an increasing frame skip, starting from 10 and extending to the maximum frame skip possible within each video, to help the model learn various camera poses. Throughout all training, We employ the Adam W (Loshchilov, 2017) optimizer, with linear warmup and a constant learning rate. Additionally, we utilize fp16 precision for computational efficiency and clip gradients to a maximum norm of 1.0 to stabilize training. For robot imitation learning, we follow the setup in Diffusion Forcing (Chen et al., 2024) where we concatenate actions and the next observation together for diffusion, with the exception that we stack the next 15 actions together for every video frame. For all experiments, we use the deterministic DDIM (Song et al., 2020) sampler with 50 steps.