Pyramidal Flow Matching for Efficient Video Generative Modeling
Authors: Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong MU, Zhouchen Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io. 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | 1Peking University, 2Kuaishou Technology, 3Beijing University of Posts and Telecommunications, 4State Key Lab of General AI, School of Intelligence Science and Technology, Peking University, 5Institute for Artificial Intelligence, Peking University, 6Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China |
| Pseudocode | Yes | Algorithm 1 Sampling with Pyramidal Flow Matching |
| Open Source Code | Yes | All code and models are open-sourced at https://pyramid-flow.github.io. |
| Open Datasets | Yes | Our model is trained on a mixed corpus of open-source image and video datasets. For images, we utilize a high-aesthetic subset of LAION-5B (Schuhmann et al., 2022), 11M from CC-12M (Changpinyo et al., 2021), 6.9M non-blurred subset of SA-1B (Kirillov et al., 2023), 4.4M from Journey DB (Sun et al., 2023), and 14M publicly available synthetic data. For video data, we incorporate the Web Vid-10M (Bain et al., 2021), Open Vid-1M (Nan et al., 2024), and another 1M high-resolution non-watermark video primarily from the Open-Sora Plan (PKU-Yuan Lab et al., 2024). |
| Dataset Splits | No | The paper uses well-known benchmark datasets for evaluation (VBenc, Eval Crafter) and external datasets for training (LAION-5B, Web Vid-10M, etc.), but does not specify how these are split into training, validation, and test sets for the experiments conducted in this paper, nor does it define custom splits with percentages or sample counts. |
| Hardware Specification | Yes | Our model undergoes a three-stage training procedure using 128 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'MM-Di T architecture from SD3 Medium', 'sinusoidal position encoding', '1D Rotary Position Embedding (Ro PE)', and 'Adam W' optimizer, but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python). |
| Experiment Setup | Yes | The detailed training hyper-parameter settings for each optimization stage are reported in Table 4. |