reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

Authors: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. (Abstract) and 4. Experiment (Section 4 title).
Researcher Affiliation	Academia	1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University 2Institute for Interdisciplinary Information Sciences, Tsinghua University 3EECS, University of California, Berkeley. Correspondence to: Jun Zhu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Implementation of Sparge Attn.
Open Source Code	Yes	The code is available at https://github.com/thu-ml/Sparge Attn.
Open Datasets	Yes	The Text-to-text model is evaluated on four zero-shot tasks: Wiki Text (Merity et al., 2017)... Longbench (Bai et al., 2024)... Infinite Bench (Zhang et al., 2024)... Needle-in-a-Haystack task (Kamradt, 2023)... open-sora (Zheng et al., 2024c) prompt sets. Text-to-image models are assessed on COCO annotations (Lin et al., 2014).
Dataset Splits	No	The paper mentions several datasets like Wiki Text, Longbench, Infinite Bench, Needle-in-a-Haystack, open-sora prompt sets, and COCO annotations, but does not provide specific training/test/validation splits (percentages, sample counts, or explicit references to standard splits for these datasets).
Hardware Specification	Yes	Full Attention End-to-End Time: 1897s on L40 Sparge Attn End-to-End Time: 1037s on L40 1.83x Speedup Figure 1. Sparge Attn can achieve 1.83x speedup on Mochi on L40 GPU, with no video quality loss. and Table 2. End-to-end generation latency using Sparge Attn. Model GPU Original Sage Attn Sparge Attn Cogvideo X RTX4090 87 s 68 s 53 s Mochi L40 1897 s 1544 s 1037 s Llama3.1 (24K) RTX4090 4.01 s 3.53 s 2.6 s Llama3.1 (128K) L40 52 s 42s 29.98 s
Software Dependencies	No	We implement our method using CUDA. (Section 4.1) and Sparge Attn+FA2 means deploying our method on Flash Attention2. (Figure 10 caption). The paper mentions software like CUDA, Flash Attention2, and Sage Attention, but does not specify their version numbers.
Experiment Setup	Yes	As discussed in Sec. 3.6, we need to determine l1, l2 for models. We use (l1 = 0.08, l2 = 0.09) for Llama3.1, (l1 = 0.05, l2 = 0.06) for Cogvideo X and Mochi, and (l1 = 0.07, l2 = 0.08) for Stable-Diffusion3.5 and Flux, (l1 = 0.03, l2 = 0.035) for Open-Sora-Plan. (Section 4.1)