Diffusion Models for Video Prediction and Infilling

Authors: Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, Andrea Dittadi

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Ra MVi D on two benchmark datasets for video prediction, on which we achieve state-of-the-art results, and one for video generation. High-resolution videos are provided at https://sites.google.com/view/video-diffusion-prediction.
Researcher Affiliation Academia Tobias Höppe EMAIL KTH Stockholm Arash Mehrjou MPI for Intelligent Systems & ETH Zürich Stefan Bauer KTH Stockholm Didrik Nielsen Norwegian Computing Center Andrea Dittadi EMAIL Technical University of Denmark & MPI for Intelligent Systems
Pseudocode Yes The pseudocode for Ra MVi D is shown in Algorithm 1.
Open Source Code Yes 1Code is available at https://github.com/Tobi-r9/Ra MVi D.
Open Datasets Yes To compare our model to prior work, we train it on the BAIR robot pushing dataset (Ebert et al., 2017). Additionally, we evaluate our model on the Kinetics-600 dataset (Carreira et al., 2018)... To quantitatively evaluate the unconditional generation performance when using p U > 0, we also train on UCF-101 (Soomro et al., 2012)
Dataset Splits Yes For evaluation, we use the same setting as Rakhimov et al. (2020), which is to predict the next 15 frames given one observed frame. We train on videos of length 20. ... On Kinetics-600, we compare our model to concurrent work by predicting 11 frames when conditioned on 5 frames (Luc et al., 2020). We additionally perform several ablation studies on video completion. We train on 16 frames and choose again K = 4. ... For evaluation we predict one sequence for each of the 256 test videos ... For evaluation we take 50,000 videos from the test set ... We train on the entire dataset of 13,320 videos.
Hardware Specification Yes This project was enabled by the Berzelius cluster at the Swedish National Supercomputer Center (NSC). ... The models are trained for 250,000 iterations with a batch size of 32 on 8 GPUs. ... For the Kinetics-600 dataset, we increase the batch size to 64 and train for 500,000 iterations on 8 GPUs. ... Each model is trained on 8 NVIDIA A100 GPUs with 40 GB of memory.
Software Dependencies No Our implementation relies on the official code of Nichol & Dhariwal (2021), adapted to video data by using 3D convolutions.
Experiment Setup Yes We set the learning rate for all our experiments to 2e-5, use a batch size of 32 for BAIR and 64 for Kinetics-600 and UCF-101, and fix T = 1000. We found, especially on the more diverse datasets like Kinetics-600 and UCF-101, that larger batch sizes produce better results. Therefore, to increase the batch size, we use gradient accumulation by computing the gradients for micro-batches of size 2 and accumulate for several steps before doing back-propagation. ... The models are trained for 250,000 iterations with a batch size of 32 on 8 GPUs. ... For the Kinetics-600 dataset, we increase the batch size to 64 and train for 500,000 iterations on 8 GPUs. ... We train Ra MVi D on UCF-101 with the same setting as used for Kinetics-600 but for 450,000 iterations.