Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Authors: Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field.
Researcher Affiliation Collaboration Yimu Wang EMAIL University of Waterloo Xuye Liu EMAIL University of Waterloo Wei Pang EMAIL University of Waterloo Li Ma EMAIL Netflix Eyeline Studios Shuai Yuan EMAIL Duke University Paul Debevec EMAIL Netflix Eyeline Studios Ning Yu EMAIL Netflix Eyeline Studios
Pseudocode Yes Algorithm 1 Classifier-guided DDPM sampling, given a diffusion model (µθ(xt), Σθ(xt)), classifier pϕ(y|xt), and gradient scale s. Algorithm 2 Classifier-guided DDIM sampling, given a diffusion model ϵθ(xt), classifier pϕ(y|xt), and gradient scale s. Algorithm 3 Joint training a diffusion model with classifier-free guidance Algorithm 4 Conditional sampling with classifier-free guidance
Open Source Code Yes A structured list of related works involved in this survey is also available on Git Hub: https://github.com/Eyeline Labs/Survey-Video-Diffusion.
Open Datasets Yes Table 2: The overview of most popular datasets used in training video generation models. We also include image datasets as they are usually used in training. I , V , T , and A represent image, video, text, and audio. Other commercial datasets include those released by Pond5, Adobe Stock, Shutterstock, Getty, Coverr, Videvo, Depositphotos, Storyblocks, Dissolve, Freepik, Vimeo, and Envato. ... UCF-101 (Soomro et al., 2012)
Dataset Splits No The paper is a survey and does not present its own experimental results requiring dataset splits. While it mentions various datasets used in other works, it does not provide specific split information for reproducibility of experiments conducted within this paper.
Hardware Specification Yes Table 1: Comparison of modules and parameters in different diffusion generative models and their industry applications. ... Cog Video(Hong et al., 2023a) ... 8 RTX 6000 ... Magic Video(Zhou et al., 2022) ... 1 A100 ... Open-Sora(Zheng et al., 2024c) ... 8 H100s
Software Dependencies No The paper mentions several software components, tools, and models like Flash Attention, ZeRO, Qwen2-VL, CLIP, and GPT-4 Vision. However, it does not provide specific version numbers for these or any other key software dependencies required for replication.
Experiment Setup No The paper is a survey of video diffusion models and reviews various methodologies and implementations. While it discusses training engineering techniques such as 'multi-resolution frame pack strategy' and 'progressive training strategy', it does not provide specific hyperparameters like learning rates, batch sizes, or optimizer settings for any model's training setup within the main text.