SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Authors: Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models including those for large language processing, image generation, and video generation. The OPS (operations per second) of our approach outperforms Flash Attention2 and xformers by about 2.1x and 2.7x, respectively. Sage Attention also achieves superior accuracy performance over Flash Attention3. We extensively evaluate the end-to-end metrics of our approach on state-of-the-art image/video generation, image classification, and language models.
Researcher Affiliation Academia Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, Jianfei Chen Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University {zhang-jt24@mails., jianfeic@, dcszj@}tsinghua.edu.cn
Pseudocode Yes Algorithm 1: Implementation of SAGEAttn-B.
Open Source Code Yes The code is available at https://github.com/thu-ml/Sage Attention.
Open Datasets Yes Llama2 is evaluated on three zero-shot tasks: Wiki Text (Merity et al., 2022) to assess the model s prediction confidence, LAMBADA (Paperno et al., 2016) evaluate contextual understanding, and MMLU (Hendrycks et al., 2020) for measuring knowledge across various subjects. Cogvideo X is evaluated using the open-sora (Zheng et al., 2024c) prompt sets. Both Ultra Pixel and Unidiffuser are assessed on the COCO annotations (Lin et al., 2014), featuring (prompt, image) pairs. TIMM is evaluated on on three image datasets: Image Net (Deng et al., 2009), Image Net-Sketch (Sketch) (Wang et al., 2019), and Image Net-Rendition (Image Netr) (Hendrycks et al., 2021). Llava1.6 is evaluated on three datasets: Text VQA (Singh et al., 2019), POPE (Li et al., 2023b), and VQAv2 (Goyal et al., 2017).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits for the models it evaluates, which are primarily pre-trained models. It mentions using specific subsets for evaluation (e.g., "first 256 annotations from the COCO 2014val dataset as the prompt set for Ultra Pixel and Unidiffuser image generation"), but not general data splits for model reproduction.
Hardware Specification Yes We implemented our Attention kernels using Open AI Triton (Tillet et al., 2019) and conducted experiments on Ubuntu 22.04 servers. Tests on the RTX 4090 utilized a server with PCIE 5.0, a 16-core Xeon(R) 6430 CPU, and 120GB DDR4 RAM, while the RTX3090 tests employed a server with a 16-core Xeon(R) 8358P CPU and 80GB DDR4 RAM.
Software Dependencies Yes To reproduce our results, experiments should be conducted in the environment of torch 2.4.0+cu121, triton-nightly (version of 20240816), python 3.11, and (gcc, g++) in version 9.
Experiment Setup Yes We use 128 for a block size of Q, and 64 for a block size of K and V. The parameters Num Warps and Num Stages, which represent the number of warp schedulers and the number of processing stages in our GPU kernels, respectively, are detailed in Table 12. Head Dim Causal Mask Num Warps Num Stages 64 False 4 3 64 True 4 4 128 False 8 3 128 True 8 5