reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HashAttention: Semantic Sparsity for Faster Inference

Authors: Aditya Desai, Shuo Yang, Alejandro Cuadron, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On A100 GPU, at 32 sparsity, incorporating Hash Attention reduces attention latency by up to 4.3 in GPTFAST and 2.54 in Flash Decode, and achieves up to 3.12 higher throughput for GPT-FAST. We present the following experiments to evaluate the quality of Hash Attention. Table 1. A head-to-head comparison of sparse-attention baselines on datasets from Long Bench from all the categories using the LLama3.1-8B-Instruct.
Researcher Affiliation	Academia	1Department of Electrical Engineering and Computer Sciences, UC Berkeley 2Department of Computer Sciences, ETH Zurich. Correspondence to: Aditya Desai <EMAIL>.
Pseudocode	No	The paper describes the 'SCORE function' and other routines as a 'general recipe of sparse attention' but does not present them in a structured pseudocode block or algorithm block.
Open Source Code	Yes	The code for Hash Attention is here. 1https://github.com/xAlg-ai/HashAttention-1.0
Open Datasets	Yes	We use Llama-3.18B-Instruct(Dubey et al., 2024) and Mistral-7B-Instruct-v0.3(Jiang et al., 2023) models and Longbench(Bai et al., 2024b) and RULER(Hsieh et al., 2024) datasets in our evaluation. All attention heads are replaced with sparse attention. Hash Attention is trained using openwebtext(Gokaslan et al., 2019) dataset with samples concatenated together to obtain the required context length.
Dataset Splits	Yes	We randomly select one dataset from each category in the Long Bench and use the first 175 samples (the last 25 samples are added to the training set). [...] Evaluation is done on first 175 examples (MFQA was excluded from training since it has only 150 samples)
Hardware Specification	Yes	On A100 GPU, at 32 sparsity, incorporating Hash Attention reduces attention latency by up to 4.3 in GPTFAST and 2.54 in Flash Decode, and achieves up to 3.12 higher throughput for GPT-FAST.
Software Dependencies	No	The paper mentions 'GPT-FAST(GPTFast, 2024)' and 'Flash Decode(Dao, 2023)' as efficiency frameworks, but does not provide specific version numbers for these or any other software libraries or dependencies used in the experiments.
Experiment Setup	Yes	In these experiments, we use 3-layered MLP (128x128-128x128-128x32) for mappings learned in Hash Attention. [...] We use binary cross-entropy loss in a multi-class setting to train our functions ϕkv and ϕq with a standard Adam optimizer. [...] We use the following formula to compute the class 1 weights, parameterized with α and β: class1-weight = α + βcontext-length (4) α and β are hyperparameters that can be chosen.