Pyramidal Flow Matching for Efficient Video Generative Modeling

Authors: Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong MU, Zhouchen Lin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io. 4 EXPERIMENTS
Researcher Affiliation Collaboration 1Peking University, 2Kuaishou Technology, 3Beijing University of Posts and Telecommunications, 4State Key Lab of General AI, School of Intelligence Science and Technology, Peking University, 5Institute for Artificial Intelligence, Peking University, 6Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China
Pseudocode Yes Algorithm 1 Sampling with Pyramidal Flow Matching
Open Source Code Yes All code and models are open-sourced at https://pyramid-flow.github.io.
Open Datasets Yes Our model is trained on a mixed corpus of open-source image and video datasets. For images, we utilize a high-aesthetic subset of LAION-5B (Schuhmann et al., 2022), 11M from CC-12M (Changpinyo et al., 2021), 6.9M non-blurred subset of SA-1B (Kirillov et al., 2023), 4.4M from Journey DB (Sun et al., 2023), and 14M publicly available synthetic data. For video data, we incorporate the Web Vid-10M (Bain et al., 2021), Open Vid-1M (Nan et al., 2024), and another 1M high-resolution non-watermark video primarily from the Open-Sora Plan (PKU-Yuan Lab et al., 2024).
Dataset Splits No The paper uses well-known benchmark datasets for evaluation (VBenc, Eval Crafter) and external datasets for training (LAION-5B, Web Vid-10M, etc.), but does not specify how these are split into training, validation, and test sets for the experiments conducted in this paper, nor does it define custom splits with percentages or sample counts.
Hardware Specification Yes Our model undergoes a three-stage training procedure using 128 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using 'MM-Di T architecture from SD3 Medium', 'sinusoidal position encoding', '1D Rotary Position Embedding (Ro PE)', and 'Adam W' optimizer, but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python).
Experiment Setup Yes The detailed training hyper-parameter settings for each optimization stage are reported in Table 4.