reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

History-Guided Video Diffusion

Authors: Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate the state-of-the-art performance and new capabilities enabled by our method, especially in long video generation. Additionally, we provide a theoretical justification of the training objective through a variational lower bound. We empirically evaluate the performance of the Diffusion Forcing Transformer and history guidance. We first validate the DFo T as a generic video model without history guidance (Sec. 6.2), demonstrating the effectiveness of the modified training objective. Next, we examine the effectiveness and additional capabilities of history guidance (Secs. 6.3 and 6.4). Finally, we showcase very long videos generated by DFo T with history guidance (Sec. 6.5). Quantitative and qualitative results are extensively discussed in the paper including FVD and VBench scores.
Researcher Affiliation	Academia	1MIT 2Carnegie Mellon University 3Harvard University. Correspondence to: Kiwhan Song <EMAIL>, Boyuan Chen <EMAIL>.
Pseudocode	Yes	Algorithm 1 Flexible Sampling with DFo T and (optionally) History Guidance
Open Source Code	No	Project website: https: //boyuan.space/history-guidance
Open Datasets	Yes	Kinetics-600 (Kay et al. (2017), 128 128), a standard video prediction benchmark, Real Estate10K or RE10K (Zhou et al. (2018), 256 256), a dataset of real-world indoor scenes with camera pose annotations, and Minecraft (Yan et al. (2023), 256 256), a dataset of long-context Minecraft navigation videos with discrete actions.
Dataset Splits	Yes	On the test split of the dataset, we evaluate the models on a video prediction task, where the model is conditioned on the first 5 history frames and asked to predict the next 11 frames. For the Kinetics-600 rollout experiment, the models generate the next 59 frames using sliding windows, given the first 5 history frames. The sliding windows are applied such that the model is always conditioned on the last 2 latent tokens and generates the next 3 latent tokens.
Hardware Specification	Yes	We utilize 12 H100 GPUs for training all of our video diffusion models, with each model requiring approximately 5 days to train under our chosen batch size. One exception is the Robot model, which is trained on 4 RTX4090 GPUs for 4 hours.
Software Dependencies	No	The paper mentions software components like "Adam W", "DDIM sampler", "cosine noise schedule", "v-parameterization", "min-SNR loss reweighting", "sigmoid loss reweighting", "fp16 precision" but does not specify their version numbers or specific library versions used (e.g., PyTorch, Python).
Experiment Setup	Yes	We train models for each dataset and for each model class (e.g., DFo T, SD, etc.), using the same pipeline within each dataset. We apply a frame skip, where training video clips are subsampled by a specific stride: a value of 1 for Kinetics-600, 2 for Minecraft, and 1 for Imitation Learning. For Real Estate10K, we use an increasing frame skip, starting from 10 and extending to the maximum frame skip possible within each video, to help the model learn various camera poses. Throughout all training, We employ the Adam W (Loshchilov, 2017) optimizer, with linear warmup and a constant learning rate. Additionally, we utilize fp16 precision for computational efficiency and clip gradients to a maximum norm of 1.0 to stabilize training. For robot imitation learning, we follow the setup in Diffusion Forcing (Chen et al., 2024) where we concatenate actions and the next observation together for diffusion, with the exception that we stack the next 15 actions together for every video frame. For all experiments, we use the deterministic DDIM (Song et al., 2020) sampler with 50 steps.