HShare: Fast LLM Decoding by Hierarchical Key-Value Sharing

Authors: Huaijin Wu, Lianqiang Li, Hantao Huang, Yi Tu, Jihang Zhang, Minghui Yu, Junchi Yan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness and efficiency of HShare across various tasks using three models: LLa MA2-7b, LLa MA3-70b, and Mistral-7b. Experimental results demonstrate that HShare achieves competitive accuracy with different sharing ratios, while delivering up to an 8.6 speedup in self-attention operations and a 2.7 improvement in end-to-end throughput compared with Flash Attention2 and GPT-fast respectively.
Researcher Affiliation Collaboration 1 Shanghai Jiao Tong University, Shanghai, China 2 Byte Dance Inc. EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 Layer(Head) Sharing Algorithm
Open Source Code Yes The source code is publicly available at https://github.com/wuhuaijin/HShare.
Open Datasets Yes We evaluate HShare on GSM8K (Cobbe et al., 2021), COQA (Reddy et al., 2019), and sixteen English datasets in Long Bench (Bai et al., 2023).
Dataset Splits No The paper mentions using GSM8K, COQA, and Long Bench datasets for evaluation. For GSM8K and COQA, it specifies 'zero-shot inference'. For Long Bench, it evaluates on 'sixteen English datasets'. While it describes settings for critical token selection (e.g., Nc = 128 or 512), it does not provide explicit training/validation/test split percentages or sample counts for any of the datasets.
Hardware Specification Yes All experiments are conducted on a machine with Xeon(R) Platinum 8336C CPU, one A100 GPU, and 128G RAM.
Software Dependencies No The paper mentions using Py Torch for attention approximation and Open AI Triton for kernel design, and states that 'our implementation is based on GPT-fast (Py Torch, 2023)'. However, it does not provide specific version numbers for PyTorch, GPT-fast, or Open AI Triton.
Experiment Setup Yes For a fair comparison, all methods select Nc = 128 critical KV cache tokens and do token sparse attention, with a token sparsity ratio of approximately 1/4 and 1/16 for the two datasets respectively (since GSM8K consists of math problems, we select a higher sparsity level, whereas COQA involves story-based question-answering, so we choose a lower sparsity level.) For HShare, we select critical tokens from three aspects, with x = 8 sink tokens, y = 32 recent tokens, and z = 88 critical tokens in the middle. We evaluate the effectiveness of our proposed HShare using LLa MA2-7b-chat, LLa MA3-70b, and Mistral-7b, selecting three sharing ratios 7/8-3/4-1/2, 3/4-3/4-1/2 and 1/2-1/2-1/2 of HShare for comparison against other methods. ... All methods select Nc = 512 critical KV cache tokens, with a token sparsity ratio of approximately 1/8. Similarly, we select critical tokens from three aspects: x = 16 sink tokens, y = 64 recent tokens, and z = 432 critical tokens in the middle. ... We conduct the self-attention latency evaluation on a single A100 GPU with batch sizes 8, and 16 and sequence lengths ranging from 1k to 4k, with a token sparsity ratio of 1/8.