Real-Time Video Generation with Pyramid Attention Broadcast
Authors: Xuanlei Zhao, Xiaolong Jin, Kai Wang, Yang You
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Pyramid Attention Broadcast (PAB), a real-time, high quality and training-free approach for Di T-based video generation. Our method is founded on the observation that attention difference in the diffusion process exhibits a U-shaped pattern, indicating significant redundancy. We mitigate this by broadcasting attention outputs to subsequent steps in a pyramid style. It applies different broadcast strategies to each attention based on their variance for best efficiency. We further introduce broadcast sequence parallel for more efficient distributed inference. PAB demonstrates up to 10.5 speedup across three models compared to baselines, achieving real-time generation for up to 720p videos. Section 3 is dedicated to 'EXPERIMENTS' where models, metrics, baselines, and implementation details are discussed, and results are presented in tables and figures. |
| Researcher Affiliation | Academia | Xuanlei Zhao1 , Xiaolong Jin2 , Kai Wang1 , Yang You1 1National University of Singapore 2Purdue University Code: NUS-HPC-AI-Lab/Video Sys EMAIL EMAIL. All authors are affiliated with universities. |
| Pseudocode | No | The paper describes the proposed method, Pyramid Attention Broadcast (PAB), and its components, including broadcast sequence parallel, through textual descriptions, figures (e.g., Figure 5: Overview of Pyramid Attention Broadcast), and mathematical formulations (Equations 1 and 2), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: NUS-HPC-AI-Lab/Video Sys |
| Open Datasets | Yes | We generate videos based on VBench s (Huang et al., 2024) prompts. To further evaluate the efficacy of our method, we expand our analysis using a subset of 1000 videos from Web Vid (Bain et al., 2021), a large-scale text-video dataset sourced from stock footage websites. |
| Dataset Splits | Yes | We generate videos based on VBench s (Huang et al., 2024) prompts. To further evaluate the efficacy of our method, we expand our analysis using a subset of 1000 videos from Web Vid (Bain et al., 2021), a large-scale text-video dataset sourced from stock footage websites. |
| Hardware Specification | Yes | All experiments are carried out on the NVIDIA H100 80GB GPUs with Pytorch. Latency is measured on 8 H100 GPUs. We evaluate the latency and speedup achieved by PAB246/PAB235 (the strategy with best quality, but less speedup) for single video generation across up to 8 NVIDIA H100 GPUs. |
| Software Dependencies | No | The paper mentions PyTorch and Flash Attention but does not provide specific version numbers for these software components. It states: 'All experiments are carried out on the NVIDIA H100 80GB GPUs with Pytorch.' and 'We enable Flash Attention (Dao et al., 2022) by default for all experiments.' |
| Experiment Setup | Yes | Table 5: The inference config of three models. model scheduler inference steps Open-Sora RFLOW 30 Open-Sora-Plan PSNR 150 Latte DDIM 50. In Section A.2 PAB Generation Settings, Table 6 details the attention broadcast configuration including diffusion timesteps and broadcast ranges (e.g., PAB246 means spatial 2, temporal 4, cross 6 with specific diffusion timesteps). Table 7 provides the MLP broadcast configuration including diffusion timesteps, block indices, and broadcast ranges. |