reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HShare: Fast LLM Decoding by Hierarchical Key-Value Sharing

Authors: Huaijin Wu, Lianqiang Li, Hantao Huang, Yi Tu, Jihang Zhang, Minghui Yu, Junchi Yan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effectiveness and efficiency of HShare across various tasks using three models: LLa MA2-7b, LLa MA3-70b, and Mistral-7b. Experimental results demonstrate that HShare achieves competitive accuracy with different sharing ratios, while delivering up to an 8.6 speedup in self-attention operations and a 2.7 improvement in end-to-end throughput compared with Flash Attention2 and GPT-fast respectively.
Researcher Affiliation	Collaboration	1 Shanghai Jiao Tong University, Shanghai, China 2 Byte Dance Inc. EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Layer(Head) Sharing Algorithm
Open Source Code	Yes	The source code is publicly available at https://github.com/wuhuaijin/HShare.
Open Datasets	Yes	We evaluate HShare on GSM8K (Cobbe et al., 2021), COQA (Reddy et al., 2019), and sixteen English datasets in Long Bench (Bai et al., 2023).
Dataset Splits	No	The paper mentions using GSM8K, COQA, and Long Bench datasets for evaluation. For GSM8K and COQA, it specifies 'zero-shot inference'. For Long Bench, it evaluates on 'sixteen English datasets'. While it describes settings for critical token selection (e.g., Nc = 128 or 512), it does not provide explicit training/validation/test split percentages or sample counts for any of the datasets.
Hardware Specification	Yes	All experiments are conducted on a machine with Xeon(R) Platinum 8336C CPU, one A100 GPU, and 128G RAM.
Software Dependencies	No	The paper mentions using Py Torch for attention approximation and Open AI Triton for kernel design, and states that 'our implementation is based on GPT-fast (Py Torch, 2023)'. However, it does not provide specific version numbers for PyTorch, GPT-fast, or Open AI Triton.
Experiment Setup	Yes	For a fair comparison, all methods select Nc = 128 critical KV cache tokens and do token sparse attention, with a token sparsity ratio of approximately 1/4 and 1/16 for the two datasets respectively (since GSM8K consists of math problems, we select a higher sparsity level, whereas COQA involves story-based question-answering, so we choose a lower sparsity level.) For HShare, we select critical tokens from three aspects, with x = 8 sink tokens, y = 32 recent tokens, and z = 88 critical tokens in the middle. We evaluate the effectiveness of our proposed HShare using LLa MA2-7b-chat, LLa MA3-70b, and Mistral-7b, selecting three sharing ratios 7/8-3/4-1/2, 3/4-3/4-1/2 and 1/2-1/2-1/2 of HShare for comparison against other methods. ... All methods select Nc = 512 critical KV cache tokens, with a token sparsity ratio of approximately 1/8. Similarly, we select critical tokens from three aspects: x = 16 sink tokens, y = 64 recent tokens, and z = 432 critical tokens in the middle. ... We conduct the self-attention latency evaluation on a single A100 GPU with batch sizes 8, and 16 and sequence lengths ranging from 1k to 4k, with a token sparsity ratio of 1/8.