HashAttention: Semantic Sparsity for Faster Inference

Authors: Aditya Desai, Shuo Yang, Alejandro Cuadron, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On A100 GPU, at 32 sparsity, incorporating Hash Attention reduces attention latency by up to 4.3 in GPTFAST and 2.54 in Flash Decode, and achieves up to 3.12 higher throughput for GPT-FAST. We present the following experiments to evaluate the quality of Hash Attention. Table 1. A head-to-head comparison of sparse-attention baselines on datasets from Long Bench from all the categories using the LLama3.1-8B-Instruct.
Researcher Affiliation Academia 1Department of Electrical Engineering and Computer Sciences, UC Berkeley 2Department of Computer Sciences, ETH Zurich. Correspondence to: Aditya Desai <EMAIL>.
Pseudocode No The paper describes the 'SCORE function' and other routines as a 'general recipe of sparse attention' but does not present them in a structured pseudocode block or algorithm block.
Open Source Code Yes The code for Hash Attention is here. 1https://github.com/xAlg-ai/HashAttention-1.0
Open Datasets Yes We use Llama-3.18B-Instruct(Dubey et al., 2024) and Mistral-7B-Instruct-v0.3(Jiang et al., 2023) models and Longbench(Bai et al., 2024b) and RULER(Hsieh et al., 2024) datasets in our evaluation. All attention heads are replaced with sparse attention. Hash Attention is trained using openwebtext(Gokaslan et al., 2019) dataset with samples concatenated together to obtain the required context length.
Dataset Splits Yes We randomly select one dataset from each category in the Long Bench and use the first 175 samples (the last 25 samples are added to the training set). [...] Evaluation is done on first 175 examples (MFQA was excluded from training since it has only 150 samples)
Hardware Specification Yes On A100 GPU, at 32 sparsity, incorporating Hash Attention reduces attention latency by up to 4.3 in GPTFAST and 2.54 in Flash Decode, and achieves up to 3.12 higher throughput for GPT-FAST.
Software Dependencies No The paper mentions 'GPT-FAST(GPTFast, 2024)' and 'Flash Decode(Dao, 2023)' as efficiency frameworks, but does not provide specific version numbers for these or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes In these experiments, we use 3-layered MLP (128x128-128x128-128x32) for mappings learned in Hash Attention. [...] We use binary cross-entropy loss in a multi-class setting to train our functions ϕkv and ϕq with a standard Adam optimizer. [...] We use the following formula to compute the class 1 weights, parameterized with α and β: class1-weight = α + βcontext-length (4) α and β are hyperparameters that can be chosen.