reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Authors: Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, Gongyi Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations across a diverse set of large language models (LLMs) demonstrate that Razor Attention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, Razor Attention is compatible with Flash Attention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model. In Section 4 titled 'EXPERIMENTS', the paper details 'ACCURACY EVALUATION', 'LONGBENCH EVALUATION', 'NEEDLE IN A HAYSTACK EVALUATION', and 'ABLATION STUDIES' using various LLMs and benchmarks like Llama2-13B-64K, Long Bench, Needle In A Haystack, GLM-9B-1M, RULER, and Infinite Bench, presenting results in tables and figures.
Researcher Affiliation	Industry	Hanlin Tang 1, Yang Lin1, Jing Lin1, Qingsen Han1, Danning Ke1, Shikuan Hong1, Yiwu Yao1, and Gongyi Wang1 1Huawei Technologies Co., Ltd
Pseudocode	Yes	Algorithm 1 Razor Attention for Ro PE Models Input: Non-retrieval headset {H}, original KV cache (after rotary transformation) {K, V }, compression ratio C, compression threshold S0, sink token num N0. 1: for non-retrieval head h {H} do 2: Compute the buffer length Lh = max S0, N C , here N is the number of tokens in the head. 3: Keeping only the recent Lh tokens near output and first N0 sink tokens, discarding the remaining tokens and compress them into a compensation token according to equation 4. 4: end for 5: Non-retrieval heads compute attention according to equation 5, while retrieval heads follow the original attention. Output: Generated output tokens.
Open Source Code	No	The paper does not provide a direct link to a source-code repository, an explicit statement of code release for the methodology described, or mention of code in supplementary materials.
Open Datasets	Yes	Figure 1: Razor Attention achieves comparable performance to the original model, even with 70% KV cache compressed. To demonstrate this, we tested Llama2-13B-64K (Fu et al., 2024) on the Needle in A Haystack benchmark (gkamradt, 2023). Figure 2: Importance-based token-dropping methods cannot work when querying the less relevant information to the main theme. Here, we use an 8K document from Long Bench (Bai et al., 2023b)... In Section 4, the paper states: 'The selected models are evaluated on Longbench (Bai et al., 2023b) and Needle In A Haystack (gkamradt, 2023) to demonstrate their capabilities in long-context circumstances.' Additional evaluations on RULER (Hsieh et al., 2024) and Infinite Bench (Zhang et al., 2024) datasets are also mentioned.
Dataset Splits	No	The paper mentions using established benchmarks like LongBench, Needle in a Haystack, RULER, and Infinite Bench for evaluation, which typically have predefined evaluation protocols. However, it does not explicitly provide specific train/test/validation split percentages, sample counts, or describe a detailed splitting methodology defined by the authors for their experiments.
Hardware Specification	Yes	The experiments are conducted on NVIDIA Ge Force RTX 4090 (24GB). We evaluate the decoding latency and throughput of Razor Attention on the GLM-9B-1M model using 8 Ascend 910B NPUs, considering different input lengths for both prefill and decoding, as shown in Table 7.
Software Dependencies	No	The paper mentions compatibility with Flash Attention and uses various pre-trained LLM models like Qwen, Llama2, Llama3, Baichuan, and GLM-9B-1M. However, it does not provide specific version numbers for any ancillary software libraries or frameworks (e.g., PyTorch version, Python version, etc.) used in its implementation or experimentation.
Experiment Setup	Yes	Hyper-parameter Settings Buffer length max(4000, N/5) Induction head protection top 14% Echo head protection top 1% Sink token num 4 Table 2: General hyper-parameter settings for experiments in the paper, which leads to 3.125x compression of KV cache under long context input. Moreover, 'ϵ is set to 0.001, the contribution of the current token in the attention map is less than 0.001 under the Alibi encoding.' and 'In our experimental setup, we retained 30% of retrieval heads(28% induction heads and 2% echo head) and implemented a compression mechanism for non-retrieval heads when sequences exceeded 4k tokens: we preserved an attention window of size 4 and a local window covering 20% of the sequence length, with the remaining tokens being directly compressed.'