HShare: Fast LLM Decoding by Hierarchical Key-Value Sharing
Authors: Huaijin Wu, Lianqiang Li, Hantao Huang, Yi Tu, Jihang Zhang, Minghui Yu, Junchi Yan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the effectiveness and efficiency of HShare across various tasks using three models: LLa MA2-7b, LLa MA3-70b, and Mistral-7b. Experimental results demonstrate that HShare achieves competitive accuracy with different sharing ratios, while delivering up to an 8.6 speedup in self-attention operations and a 2.7 improvement in end-to-end throughput compared with Flash Attention2 and GPT-fast respectively. |
| Researcher Affiliation | Collaboration | 1 Shanghai Jiao Tong University, Shanghai, China 2 Byte Dance Inc. EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Layer(Head) Sharing Algorithm |
| Open Source Code | Yes | The source code is publicly available at https://github.com/wuhuaijin/HShare. |
| Open Datasets | Yes | We evaluate HShare on GSM8K (Cobbe et al., 2021), COQA (Reddy et al., 2019), and sixteen English datasets in Long Bench (Bai et al., 2023). |
| Dataset Splits | No | The paper mentions using GSM8K, COQA, and Long Bench datasets for evaluation. For GSM8K and COQA, it specifies 'zero-shot inference'. For Long Bench, it evaluates on 'sixteen English datasets'. While it describes settings for critical token selection (e.g., Nc = 128 or 512), it does not provide explicit training/validation/test split percentages or sample counts for any of the datasets. |
| Hardware Specification | Yes | All experiments are conducted on a machine with Xeon(R) Platinum 8336C CPU, one A100 GPU, and 128G RAM. |
| Software Dependencies | No | The paper mentions using Py Torch for attention approximation and Open AI Triton for kernel design, and states that 'our implementation is based on GPT-fast (Py Torch, 2023)'. However, it does not provide specific version numbers for PyTorch, GPT-fast, or Open AI Triton. |
| Experiment Setup | Yes | For a fair comparison, all methods select Nc = 128 critical KV cache tokens and do token sparse attention, with a token sparsity ratio of approximately 1/4 and 1/16 for the two datasets respectively (since GSM8K consists of math problems, we select a higher sparsity level, whereas COQA involves story-based question-answering, so we choose a lower sparsity level.) For HShare, we select critical tokens from three aspects, with x = 8 sink tokens, y = 32 recent tokens, and z = 88 critical tokens in the middle. We evaluate the effectiveness of our proposed HShare using LLa MA2-7b-chat, LLa MA3-70b, and Mistral-7b, selecting three sharing ratios 7/8-3/4-1/2, 3/4-3/4-1/2 and 1/2-1/2-1/2 of HShare for comparison against other methods. ... All methods select Nc = 512 critical KV cache tokens, with a token sparsity ratio of approximately 1/8. Similarly, we select critical tokens from three aspects: x = 16 sink tokens, y = 64 recent tokens, and z = 432 critical tokens in the middle. ... We conduct the self-attention latency evaluation on a single A100 GPU with batch sizes 8, and 16 and sequence lengths ranging from 1k to 4k, with a token sparsity ratio of 1/8. |