SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Authors: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. (Abstract) and 4. Experiment (Section 4 title). |
| Researcher Affiliation | Academia | 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University 2Institute for Interdisciplinary Information Sciences, Tsinghua University 3EECS, University of California, Berkeley. Correspondence to: Jun Zhu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Implementation of Sparge Attn. |
| Open Source Code | Yes | The code is available at https://github.com/thu-ml/Sparge Attn. |
| Open Datasets | Yes | The Text-to-text model is evaluated on four zero-shot tasks: Wiki Text (Merity et al., 2017)... Longbench (Bai et al., 2024)... Infinite Bench (Zhang et al., 2024)... Needle-in-a-Haystack task (Kamradt, 2023)... open-sora (Zheng et al., 2024c) prompt sets. Text-to-image models are assessed on COCO annotations (Lin et al., 2014). |
| Dataset Splits | No | The paper mentions several datasets like Wiki Text, Longbench, Infinite Bench, Needle-in-a-Haystack, open-sora prompt sets, and COCO annotations, but does not provide specific training/test/validation splits (percentages, sample counts, or explicit references to standard splits for these datasets). |
| Hardware Specification | Yes | Full Attention End-to-End Time: 1897s on L40 Sparge Attn End-to-End Time: 1037s on L40 1.83x Speedup Figure 1. Sparge Attn can achieve 1.83x speedup on Mochi on L40 GPU, with no video quality loss. and Table 2. End-to-end generation latency using Sparge Attn. Model GPU Original Sage Attn Sparge Attn Cogvideo X RTX4090 87 s 68 s 53 s Mochi L40 1897 s 1544 s 1037 s Llama3.1 (24K) RTX4090 4.01 s 3.53 s 2.6 s Llama3.1 (128K) L40 52 s 42s 29.98 s |
| Software Dependencies | No | We implement our method using CUDA. (Section 4.1) and Sparge Attn+FA2 means deploying our method on Flash Attention2. (Figure 10 caption). The paper mentions software like CUDA, Flash Attention2, and Sage Attention, but does not specify their version numbers. |
| Experiment Setup | Yes | As discussed in Sec. 3.6, we need to determine l1, l2 for models. We use (l1 = 0.08, l2 = 0.09) for Llama3.1, (l1 = 0.05, l2 = 0.06) for Cogvideo X and Mochi, and (l1 = 0.07, l2 = 0.08) for Stable-Diffusion3.5 and Flux, (l1 = 0.03, l2 = 0.035) for Open-Sora-Plan. (Section 4.1) |