reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prototype SVG with customized kernel implementation by Triton (Tillet et al., 2019) and Flash Infer (Ye et al., 2025) and evaluate SVG s accuracy and efficiency on representative open video generative models including Cog Video Xv1.5-I2V, Cog Video X-v1.5-T2V, and Hunyuan Video-T2V. SVG delivers significant efficiency improvements, achieving an end-to-end speedup of up to 2.33 while maintaining high visual quality with a PSNR of up to 29, outperforming all prior methods.
Researcher Affiliation	Collaboration	Haocheng Xi * Shuo Yang * Yilong Zhao Chenfeng Xu Muyang Li Xiuyu Li Yujun Lin Han Cai Jintao Zhang Dacheng Li Jianfei Chen Ion Stoica Kurt Keutzer Song Han *Equal contribution. University of California, Berkeley. NVIDIA. MIT. Correspondence to: Chenfeng Xu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Online Profiling Strategy # Q, K, V, O: [B, H, S, D] query, key, value, output # S: Total Token Number E.g., 18k # t: Sampled Token Number. E.g., 32 # Sample the Indices indices = sample_indices(S, t) # (t,) Q_i = Q[:, :, indices, :] # Get the attention masks mask_spatial = gen_spatial_mask()[:, :, indices, :] mask_temporal = gen_temporal_mask()[:, :, indices, :] # Compute sampled attention score # Shape: [B, H, t, D] O_full = mask_attention(Q_i, K, V, None) O_spatial = mask_attention(Q_i, K, V, mask_spatial) O_temporal = mask_attention(Q_i, K, V, mask_temporal) # Calculate MSE and get best mask # Shape: [B, H] MSE_s = (O_full O_spatial).norm().mean(dim=(2,3)) MSE_t = (O_full O_temporal).norm().mean(dim=(2,3)) best_mask_config = (MSE_s < MSE_t)
Open Source Code	Yes	Our code is open-sourced at https://github.com/svg-project/Sparse-Video Gen.
Open Datasets	Yes	For Cog Video X-v1.5, we generate video using the VBench dataset after prompt optimization, as suggested by Cog Video X (Yang et al., 2024c). For Hunyuan Video, we benchmark our method using the prompt in Penguin Video Benchmark released by Hunyuan Video (Kong et al., 2024).
Dataset Splits	No	The paper mentions evaluating on VBench and Penguin Video Benchmark, which are datasets. It also states "We skip the first 25% denoising steps for all baselines as they are critical to generation quality". However, it does not provide explicit training, validation, or test splits for these datasets. It refers to using the benchmarks for evaluation rather than defining splits for model training or validation.
Hardware Specification	Yes	For instance, Hunyuan Video requires almost an hour on an NVIDIA A100 GPU to generate only a 5-second video... We incorporate the end-to-end efficiency metric including FLOPS, latency, and corresponding speedup into Table 1. SVG consistently outperforms all baselines by achieving an average speedup of 2.28 while maintaining the highest generation quality. We further provide a detailed breakdown of end-to-end inference time on Hunyuan Video in Figure 7 to analyze the speedup. Each design point described in Sec 4 contributes significantly to the speedup, with sparse attention delivering the most substantial improvement of 1.81 . Kernel-level efficiency benchmark. We benchmark individual kernel performance including QK-norm, Ro PE, and block sparse attention with unit tests in Table 2. Our customized QK-norm and Ro PE achieve consistently better throughput across all frame numbers, with an average speedup of 7.4 and 15.5 , respectively. For the sparse attention kernel, we compare the latency of our customized kernel with the theoretical speedup across different sparsity. As shown in Figure 8, our kernel achieves theoretical speedup, enabling practical benefit from sparse attention. End-to-end speedup benchmark. We incorporate the end-to-end efficiency metric including FLOPS, latency, and corresponding speedup into Table 1. SVG consistently outperforms all baselines by achieving an average speedup of 2.28 while maintaining the highest generation quality. We further provide a detailed breakdown of end-to-end inference time on Hunyuan Video in Figure 7 to analyze the speedup. Each design point described in Sec 4 contributes significantly to the speedup, with sparse attention delivering the most substantial improvement of 1.81 . Kernel-level efficiency benchmark. We benchmark individual kernel performance including QK-norm, Ro PE, and block sparse attention with unit tests in Table 2. Our customized QK-norm and Ro PE achieve consistently better throughput across all frame numbers, with an average speedup of 7.4 and 15.5 , respectively. For the sparse attention kernel, we compare the latency of our customized kernel with the theoretical speedup across different sparsity. As shown in Figure 8, our kernel achieves theoretical speedup, enabling practical benefit from sparse attention. To demonstrate the feasibility of SVG, we prototype the entire framework with dedicated CUDA kernels based on Flash Attention (Dao et al., 2022), Flash Infer (Ye et al., 2025), and Triton (Tillet et al., 2019). We first showcase the end-to-end speedup of SVG compared to baselines on an H100-80GB-HBM3 with CUDA 12.4.
Software Dependencies	No	To demonstrate the feasibility of SVG, we prototype the entire framework with dedicated CUDA kernels based on Flash Attention (Dao et al., 2022), Flash Infer (Ye et al., 2025), and Triton (Tillet et al., 2019). We first showcase the end-to-end speedup of SVG compared to baselines on an H100-80GB-HBM3 with CUDA 12.4. We prototype SVG with customized kernel implementation by Triton (Tillet et al., 2019) and Flash Infer (Ye et al., 2025).
Experiment Setup	Yes	Parameters. For MInference and PAB, we use their official configurations. For SVG, we choose cs as 4 frames and ct as 1224 tokens for Cog Video X-v1.5, while cs as 10 frames and ct as 1200 tokens for Hunyuan Video. Such configurations lead to approximately 30% sparsity for both spatial head and temporal head, which is enough for lossless generation in general. We skip the first 25% denoising steps for all baselines as they are critical to generation quality, following previous works (Zhao et al., 2024b; Li et al., 2024; Lv et al., 2024; Liu et al., 2024a). To demonstrate the effectiveness of the proposed method, we conduct a sensitivity test on profiling ratio x with Cog Video X-v1.5-I2V. As shown in Table 3, profiling only 1% can achieve up to 31.1 PSNR, with only 3% runtime overhead compared to full attention.