Survey of Video Diffusion Models: Foundations, Implementations, and Applications
Authors: Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. |
| Researcher Affiliation | Collaboration | Yimu Wang EMAIL University of Waterloo Xuye Liu EMAIL University of Waterloo Wei Pang EMAIL University of Waterloo Li Ma EMAIL Netflix Eyeline Studios Shuai Yuan EMAIL Duke University Paul Debevec EMAIL Netflix Eyeline Studios Ning Yu EMAIL Netflix Eyeline Studios |
| Pseudocode | Yes | Algorithm 1 Classifier-guided DDPM sampling, given a diffusion model (µθ(xt), Σθ(xt)), classifier pϕ(y|xt), and gradient scale s. Algorithm 2 Classifier-guided DDIM sampling, given a diffusion model ϵθ(xt), classifier pϕ(y|xt), and gradient scale s. Algorithm 3 Joint training a diffusion model with classifier-free guidance Algorithm 4 Conditional sampling with classifier-free guidance |
| Open Source Code | Yes | A structured list of related works involved in this survey is also available on Git Hub: https://github.com/Eyeline Labs/Survey-Video-Diffusion. |
| Open Datasets | Yes | Table 2: The overview of most popular datasets used in training video generation models. We also include image datasets as they are usually used in training. I , V , T , and A represent image, video, text, and audio. Other commercial datasets include those released by Pond5, Adobe Stock, Shutterstock, Getty, Coverr, Videvo, Depositphotos, Storyblocks, Dissolve, Freepik, Vimeo, and Envato. ... UCF-101 (Soomro et al., 2012) |
| Dataset Splits | No | The paper is a survey and does not present its own experimental results requiring dataset splits. While it mentions various datasets used in other works, it does not provide specific split information for reproducibility of experiments conducted within this paper. |
| Hardware Specification | Yes | Table 1: Comparison of modules and parameters in different diffusion generative models and their industry applications. ... Cog Video(Hong et al., 2023a) ... 8 RTX 6000 ... Magic Video(Zhou et al., 2022) ... 1 A100 ... Open-Sora(Zheng et al., 2024c) ... 8 H100s |
| Software Dependencies | No | The paper mentions several software components, tools, and models like Flash Attention, ZeRO, Qwen2-VL, CLIP, and GPT-4 Vision. However, it does not provide specific version numbers for these or any other key software dependencies required for replication. |
| Experiment Setup | No | The paper is a survey of video diffusion models and reviews various methodologies and implementations. While it discusses training engineering techniques such as 'multi-resolution frame pack strategy' and 'progressive training strategy', it does not provide specific hyperparameters like learning rates, batch sizes, or optimizer settings for any model's training setup within the main text. |