RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Authors: Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, Gongyi Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations across a diverse set of large language models (LLMs) demonstrate that Razor Attention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, Razor Attention is compatible with Flash Attention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model. In Section 4 titled 'EXPERIMENTS', the paper details 'ACCURACY EVALUATION', 'LONGBENCH EVALUATION', 'NEEDLE IN A HAYSTACK EVALUATION', and 'ABLATION STUDIES' using various LLMs and benchmarks like Llama2-13B-64K, Long Bench, Needle In A Haystack, GLM-9B-1M, RULER, and Infinite Bench, presenting results in tables and figures. |
| Researcher Affiliation | Industry | Hanlin Tang 1, Yang Lin1, Jing Lin1, Qingsen Han1, Danning Ke1, Shikuan Hong1, Yiwu Yao1, and Gongyi Wang1 1Huawei Technologies Co., Ltd |
| Pseudocode | Yes | Algorithm 1 Razor Attention for Ro PE Models Input: Non-retrieval headset {H}, original KV cache (after rotary transformation) {K, V }, compression ratio C, compression threshold S0, sink token num N0. 1: for non-retrieval head h {H} do 2: Compute the buffer length Lh = max S0, N C , here N is the number of tokens in the head. 3: Keeping only the recent Lh tokens near output and first N0 sink tokens, discarding the remaining tokens and compress them into a compensation token according to equation 4. 4: end for 5: Non-retrieval heads compute attention according to equation 5, while retrieval heads follow the original attention. Output: Generated output tokens. |
| Open Source Code | No | The paper does not provide a direct link to a source-code repository, an explicit statement of code release for the methodology described, or mention of code in supplementary materials. |
| Open Datasets | Yes | Figure 1: Razor Attention achieves comparable performance to the original model, even with 70% KV cache compressed. To demonstrate this, we tested Llama2-13B-64K (Fu et al., 2024) on the Needle in A Haystack benchmark (gkamradt, 2023). Figure 2: Importance-based token-dropping methods cannot work when querying the less relevant information to the main theme. Here, we use an 8K document from Long Bench (Bai et al., 2023b)... In Section 4, the paper states: 'The selected models are evaluated on Longbench (Bai et al., 2023b) and Needle In A Haystack (gkamradt, 2023) to demonstrate their capabilities in long-context circumstances.' Additional evaluations on RULER (Hsieh et al., 2024) and Infinite Bench (Zhang et al., 2024) datasets are also mentioned. |
| Dataset Splits | No | The paper mentions using established benchmarks like LongBench, Needle in a Haystack, RULER, and Infinite Bench for evaluation, which typically have predefined evaluation protocols. However, it does not explicitly provide specific train/test/validation split percentages, sample counts, or describe a detailed splitting methodology defined by the authors for their experiments. |
| Hardware Specification | Yes | The experiments are conducted on NVIDIA Ge Force RTX 4090 (24GB). We evaluate the decoding latency and throughput of Razor Attention on the GLM-9B-1M model using 8 Ascend 910B NPUs, considering different input lengths for both prefill and decoding, as shown in Table 7. |
| Software Dependencies | No | The paper mentions compatibility with Flash Attention and uses various pre-trained LLM models like Qwen, Llama2, Llama3, Baichuan, and GLM-9B-1M. However, it does not provide specific version numbers for any ancillary software libraries or frameworks (e.g., PyTorch version, Python version, etc.) used in its implementation or experimentation. |
| Experiment Setup | Yes | Hyper-parameter Settings Buffer length max(4000, N/5) Induction head protection top 14% Echo head protection top 1% Sink token num 4 Table 2: General hyper-parameter settings for experiments in the paper, which leads to 3.125x compression of KV cache under long context input. Moreover, 'ϵ is set to 0.001, the contribution of the current token in the attention map is less than 0.001 under the Alibi encoding.' and 'In our experimental setup, we retained 30% of retrieval heads(28% induction heads and 2% echo head) and implemented a compression mechanism for non-retrieval heads when sequences exceeded 4k tokens: we preserved an attention window of size 4 and a local window covering 20% of the sequence length, with the remaining tokens being directly compressed.' |