SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

Authors: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. (Abstract) and 4. Experiment (Section 4 title).
Researcher Affiliation Academia 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University 2Institute for Interdisciplinary Information Sciences, Tsinghua University 3EECS, University of California, Berkeley. Correspondence to: Jun Zhu <EMAIL>.
Pseudocode Yes Algorithm 1 Implementation of Sparge Attn.
Open Source Code Yes The code is available at https://github.com/thu-ml/Sparge Attn.
Open Datasets Yes The Text-to-text model is evaluated on four zero-shot tasks: Wiki Text (Merity et al., 2017)... Longbench (Bai et al., 2024)... Infinite Bench (Zhang et al., 2024)... Needle-in-a-Haystack task (Kamradt, 2023)... open-sora (Zheng et al., 2024c) prompt sets. Text-to-image models are assessed on COCO annotations (Lin et al., 2014).
Dataset Splits No The paper mentions several datasets like Wiki Text, Longbench, Infinite Bench, Needle-in-a-Haystack, open-sora prompt sets, and COCO annotations, but does not provide specific training/test/validation splits (percentages, sample counts, or explicit references to standard splits for these datasets).
Hardware Specification Yes Full Attention End-to-End Time: 1897s on L40 Sparge Attn End-to-End Time: 1037s on L40 1.83x Speedup Figure 1. Sparge Attn can achieve 1.83x speedup on Mochi on L40 GPU, with no video quality loss. and Table 2. End-to-end generation latency using Sparge Attn. Model GPU Original Sage Attn Sparge Attn Cogvideo X RTX4090 87 s 68 s 53 s Mochi L40 1897 s 1544 s 1037 s Llama3.1 (24K) RTX4090 4.01 s 3.53 s 2.6 s Llama3.1 (128K) L40 52 s 42s 29.98 s
Software Dependencies No We implement our method using CUDA. (Section 4.1) and Sparge Attn+FA2 means deploying our method on Flash Attention2. (Figure 10 caption). The paper mentions software like CUDA, Flash Attention2, and Sage Attention, but does not specify their version numbers.
Experiment Setup Yes As discussed in Sec. 3.6, we need to determine l1, l2 for models. We use (l1 = 0.08, l2 = 0.09) for Llama3.1, (l1 = 0.05, l2 = 0.06) for Cogvideo X and Mochi, and (l1 = 0.07, l2 = 0.08) for Stable-Diffusion3.5 and Flux, (l1 = 0.03, l2 = 0.035) for Open-Sora-Plan. (Section 4.1)