Diffusion Models for Video Prediction and Infilling
Authors: Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, Andrea Dittadi
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Ra MVi D on two benchmark datasets for video prediction, on which we achieve state-of-the-art results, and one for video generation. High-resolution videos are provided at https://sites.google.com/view/video-diffusion-prediction. |
| Researcher Affiliation | Academia | Tobias Höppe EMAIL KTH Stockholm Arash Mehrjou MPI for Intelligent Systems & ETH Zürich Stefan Bauer KTH Stockholm Didrik Nielsen Norwegian Computing Center Andrea Dittadi EMAIL Technical University of Denmark & MPI for Intelligent Systems |
| Pseudocode | Yes | The pseudocode for Ra MVi D is shown in Algorithm 1. |
| Open Source Code | Yes | 1Code is available at https://github.com/Tobi-r9/Ra MVi D. |
| Open Datasets | Yes | To compare our model to prior work, we train it on the BAIR robot pushing dataset (Ebert et al., 2017). Additionally, we evaluate our model on the Kinetics-600 dataset (Carreira et al., 2018)... To quantitatively evaluate the unconditional generation performance when using p U > 0, we also train on UCF-101 (Soomro et al., 2012) |
| Dataset Splits | Yes | For evaluation, we use the same setting as Rakhimov et al. (2020), which is to predict the next 15 frames given one observed frame. We train on videos of length 20. ... On Kinetics-600, we compare our model to concurrent work by predicting 11 frames when conditioned on 5 frames (Luc et al., 2020). We additionally perform several ablation studies on video completion. We train on 16 frames and choose again K = 4. ... For evaluation we predict one sequence for each of the 256 test videos ... For evaluation we take 50,000 videos from the test set ... We train on the entire dataset of 13,320 videos. |
| Hardware Specification | Yes | This project was enabled by the Berzelius cluster at the Swedish National Supercomputer Center (NSC). ... The models are trained for 250,000 iterations with a batch size of 32 on 8 GPUs. ... For the Kinetics-600 dataset, we increase the batch size to 64 and train for 500,000 iterations on 8 GPUs. ... Each model is trained on 8 NVIDIA A100 GPUs with 40 GB of memory. |
| Software Dependencies | No | Our implementation relies on the official code of Nichol & Dhariwal (2021), adapted to video data by using 3D convolutions. |
| Experiment Setup | Yes | We set the learning rate for all our experiments to 2e-5, use a batch size of 32 for BAIR and 64 for Kinetics-600 and UCF-101, and fix T = 1000. We found, especially on the more diverse datasets like Kinetics-600 and UCF-101, that larger batch sizes produce better results. Therefore, to increase the batch size, we use gradient accumulation by computing the gradients for micro-batches of size 2 and accumulate for several steps before doing back-propagation. ... The models are trained for 250,000 iterations with a batch size of 32 on 8 GPUs. ... For the Kinetics-600 dataset, we increase the batch size to 64 and train for 500,000 iterations on 8 GPUs. ... We train Ra MVi D on UCF-101 with the same setting as used for Kinetics-600 but for 450,000 iterations. |