Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prototype SVG with customized kernel implementation by Triton (Tillet et al., 2019) and Flash Infer (Ye et al., 2025) and evaluate SVG s accuracy and efficiency on representative open video generative models including Cog Video Xv1.5-I2V, Cog Video X-v1.5-T2V, and Hunyuan Video-T2V. SVG delivers significant efficiency improvements, achieving an end-to-end speedup of up to 2.33 while maintaining high visual quality with a PSNR of up to 29, outperforming all prior methods. |
| Researcher Affiliation | Collaboration | Haocheng Xi * Shuo Yang * Yilong Zhao Chenfeng Xu Muyang Li Xiuyu Li Yujun Lin Han Cai Jintao Zhang Dacheng Li Jianfei Chen Ion Stoica Kurt Keutzer Song Han *Equal contribution. University of California, Berkeley. NVIDIA. MIT. Correspondence to: Chenfeng Xu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Online Profiling Strategy # Q, K, V, O: [B, H, S, D] query, key, value, output # S: Total Token Number E.g., 18k # t: Sampled Token Number. E.g., 32 # Sample the Indices indices = sample_indices(S, t) # (t,) Q_i = Q[:, :, indices, :] # Get the attention masks mask_spatial = gen_spatial_mask()[:, :, indices, :] mask_temporal = gen_temporal_mask()[:, :, indices, :] # Compute sampled attention score # Shape: [B, H, t, D] O_full = mask_attention(Q_i, K, V, None) O_spatial = mask_attention(Q_i, K, V, mask_spatial) O_temporal = mask_attention(Q_i, K, V, mask_temporal) # Calculate MSE and get best mask # Shape: [B, H] MSE_s = (O_full O_spatial).norm().mean(dim=(2,3)) MSE_t = (O_full O_temporal).norm().mean(dim=(2,3)) best_mask_config = (MSE_s < MSE_t) |
| Open Source Code | Yes | Our code is open-sourced at https://github.com/svg-project/Sparse-Video Gen. |
| Open Datasets | Yes | For Cog Video X-v1.5, we generate video using the VBench dataset after prompt optimization, as suggested by Cog Video X (Yang et al., 2024c). For Hunyuan Video, we benchmark our method using the prompt in Penguin Video Benchmark released by Hunyuan Video (Kong et al., 2024). |
| Dataset Splits | No | The paper mentions evaluating on VBench and Penguin Video Benchmark, which are datasets. It also states "We skip the first 25% denoising steps for all baselines as they are critical to generation quality". However, it does not provide explicit training, validation, or test splits for these datasets. It refers to using the benchmarks for evaluation rather than defining splits for model training or validation. |
| Hardware Specification | Yes | For instance, Hunyuan Video requires almost an hour on an NVIDIA A100 GPU to generate only a 5-second video... We incorporate the end-to-end efficiency metric including FLOPS, latency, and corresponding speedup into Table 1. SVG consistently outperforms all baselines by achieving an average speedup of 2.28 while maintaining the highest generation quality. We further provide a detailed breakdown of end-to-end inference time on Hunyuan Video in Figure 7 to analyze the speedup. Each design point described in Sec 4 contributes significantly to the speedup, with sparse attention delivering the most substantial improvement of 1.81 . Kernel-level efficiency benchmark. We benchmark individual kernel performance including QK-norm, Ro PE, and block sparse attention with unit tests in Table 2. Our customized QK-norm and Ro PE achieve consistently better throughput across all frame numbers, with an average speedup of 7.4 and 15.5 , respectively. For the sparse attention kernel, we compare the latency of our customized kernel with the theoretical speedup across different sparsity. As shown in Figure 8, our kernel achieves theoretical speedup, enabling practical benefit from sparse attention. End-to-end speedup benchmark. We incorporate the end-to-end efficiency metric including FLOPS, latency, and corresponding speedup into Table 1. SVG consistently outperforms all baselines by achieving an average speedup of 2.28 while maintaining the highest generation quality. We further provide a detailed breakdown of end-to-end inference time on Hunyuan Video in Figure 7 to analyze the speedup. Each design point described in Sec 4 contributes significantly to the speedup, with sparse attention delivering the most substantial improvement of 1.81 . Kernel-level efficiency benchmark. We benchmark individual kernel performance including QK-norm, Ro PE, and block sparse attention with unit tests in Table 2. Our customized QK-norm and Ro PE achieve consistently better throughput across all frame numbers, with an average speedup of 7.4 and 15.5 , respectively. For the sparse attention kernel, we compare the latency of our customized kernel with the theoretical speedup across different sparsity. As shown in Figure 8, our kernel achieves theoretical speedup, enabling practical benefit from sparse attention. To demonstrate the feasibility of SVG, we prototype the entire framework with dedicated CUDA kernels based on Flash Attention (Dao et al., 2022), Flash Infer (Ye et al., 2025), and Triton (Tillet et al., 2019). We first showcase the end-to-end speedup of SVG compared to baselines on an H100-80GB-HBM3 with CUDA 12.4. |
| Software Dependencies | No | To demonstrate the feasibility of SVG, we prototype the entire framework with dedicated CUDA kernels based on Flash Attention (Dao et al., 2022), Flash Infer (Ye et al., 2025), and Triton (Tillet et al., 2019). We first showcase the end-to-end speedup of SVG compared to baselines on an H100-80GB-HBM3 with CUDA 12.4. We prototype SVG with customized kernel implementation by Triton (Tillet et al., 2019) and Flash Infer (Ye et al., 2025). |
| Experiment Setup | Yes | Parameters. For MInference and PAB, we use their official configurations. For SVG, we choose cs as 4 frames and ct as 1224 tokens for Cog Video X-v1.5, while cs as 10 frames and ct as 1200 tokens for Hunyuan Video. Such configurations lead to approximately 30% sparsity for both spatial head and temporal head, which is enough for lossless generation in general. We skip the first 25% denoising steps for all baselines as they are critical to generation quality, following previous works (Zhao et al., 2024b; Li et al., 2024; Lv et al., 2024; Liu et al., 2024a). To demonstrate the effectiveness of the proposed method, we conduct a sensitivity test on profiling ratio x with Cog Video X-v1.5-I2V. As shown in Table 3, profiling only 1% can achieve up to 31.1 PSNR, with only 3% runtime overhead compared to full attention. |