SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
Authors: Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models including those for large language processing, image generation, and video generation. The OPS (operations per second) of our approach outperforms Flash Attention2 and xformers by about 2.1x and 2.7x, respectively. Sage Attention also achieves superior accuracy performance over Flash Attention3. We extensively evaluate the end-to-end metrics of our approach on state-of-the-art image/video generation, image classification, and language models. |
| Researcher Affiliation | Academia | Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, Jianfei Chen Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University {zhang-jt24@mails., jianfeic@, dcszj@}tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1: Implementation of SAGEAttn-B. |
| Open Source Code | Yes | The code is available at https://github.com/thu-ml/Sage Attention. |
| Open Datasets | Yes | Llama2 is evaluated on three zero-shot tasks: Wiki Text (Merity et al., 2022) to assess the model s prediction confidence, LAMBADA (Paperno et al., 2016) evaluate contextual understanding, and MMLU (Hendrycks et al., 2020) for measuring knowledge across various subjects. Cogvideo X is evaluated using the open-sora (Zheng et al., 2024c) prompt sets. Both Ultra Pixel and Unidiffuser are assessed on the COCO annotations (Lin et al., 2014), featuring (prompt, image) pairs. TIMM is evaluated on on three image datasets: Image Net (Deng et al., 2009), Image Net-Sketch (Sketch) (Wang et al., 2019), and Image Net-Rendition (Image Netr) (Hendrycks et al., 2021). Llava1.6 is evaluated on three datasets: Text VQA (Singh et al., 2019), POPE (Li et al., 2023b), and VQAv2 (Goyal et al., 2017). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits for the models it evaluates, which are primarily pre-trained models. It mentions using specific subsets for evaluation (e.g., "first 256 annotations from the COCO 2014val dataset as the prompt set for Ultra Pixel and Unidiffuser image generation"), but not general data splits for model reproduction. |
| Hardware Specification | Yes | We implemented our Attention kernels using Open AI Triton (Tillet et al., 2019) and conducted experiments on Ubuntu 22.04 servers. Tests on the RTX 4090 utilized a server with PCIE 5.0, a 16-core Xeon(R) 6430 CPU, and 120GB DDR4 RAM, while the RTX3090 tests employed a server with a 16-core Xeon(R) 8358P CPU and 80GB DDR4 RAM. |
| Software Dependencies | Yes | To reproduce our results, experiments should be conducted in the environment of torch 2.4.0+cu121, triton-nightly (version of 20240816), python 3.11, and (gcc, g++) in version 9. |
| Experiment Setup | Yes | We use 128 for a block size of Q, and 64 for a block size of K and V. The parameters Num Warps and Num Stages, which represent the number of warp schedulers and the number of processing stages in our GPU kernels, respectively, are detailed in Table 12. Head Dim Causal Mask Num Warps Num Stages 64 False 4 3 64 True 4 4 128 False 8 3 128 True 8 5 |