reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Authors: Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models including those for large language processing, image generation, and video generation. The OPS (operations per second) of our approach outperforms Flash Attention2 and xformers by about 2.1x and 2.7x, respectively. Sage Attention also achieves superior accuracy performance over Flash Attention3. We extensively evaluate the end-to-end metrics of our approach on state-of-the-art image/video generation, image classification, and language models.
Researcher Affiliation	Academia	Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, Jianfei Chen Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University {zhang-jt24@mails., jianfeic@, dcszj@}tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1: Implementation of SAGEAttn-B.
Open Source Code	Yes	The code is available at https://github.com/thu-ml/Sage Attention.
Open Datasets	Yes	Llama2 is evaluated on three zero-shot tasks: Wiki Text (Merity et al., 2022) to assess the model s prediction confidence, LAMBADA (Paperno et al., 2016) evaluate contextual understanding, and MMLU (Hendrycks et al., 2020) for measuring knowledge across various subjects. Cogvideo X is evaluated using the open-sora (Zheng et al., 2024c) prompt sets. Both Ultra Pixel and Unidiffuser are assessed on the COCO annotations (Lin et al., 2014), featuring (prompt, image) pairs. TIMM is evaluated on on three image datasets: Image Net (Deng et al., 2009), Image Net-Sketch (Sketch) (Wang et al., 2019), and Image Net-Rendition (Image Netr) (Hendrycks et al., 2021). Llava1.6 is evaluated on three datasets: Text VQA (Singh et al., 2019), POPE (Li et al., 2023b), and VQAv2 (Goyal et al., 2017).
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits for the models it evaluates, which are primarily pre-trained models. It mentions using specific subsets for evaluation (e.g., "first 256 annotations from the COCO 2014val dataset as the prompt set for Ultra Pixel and Unidiffuser image generation"), but not general data splits for model reproduction.
Hardware Specification	Yes	We implemented our Attention kernels using Open AI Triton (Tillet et al., 2019) and conducted experiments on Ubuntu 22.04 servers. Tests on the RTX 4090 utilized a server with PCIE 5.0, a 16-core Xeon(R) 6430 CPU, and 120GB DDR4 RAM, while the RTX3090 tests employed a server with a 16-core Xeon(R) 8358P CPU and 80GB DDR4 RAM.
Software Dependencies	Yes	To reproduce our results, experiments should be conducted in the environment of torch 2.4.0+cu121, triton-nightly (version of 20240816), python 3.11, and (gcc, g++) in version 9.
Experiment Setup	Yes	We use 128 for a block size of Q, and 64 for a block size of K and V. The parameters Num Warps and Num Stages, which represent the number of warp schedulers and the number of processing stages in our GPU kernels, respectively, are detailed in Table 12. Head Dim Causal Mask Num Warps Num Stages 64 False 4 3 64 True 4 4 128 False 8 3 128 True 8 5