CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration
Authors: Haoyun Jiang, Haolin Li, Jianwei Zhang, Fei Huang, Qiang Hu, Minmin Sun, Shuai Xiao, Yong Li, Junyang Lin, Jiangchao Yao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, Cate KV reduces memory usage by up to 2.72 and accelerates decoding by 2.18 in single-sample inputs, and boosts throughput by 3.96 in batch scenarios. Comprehensive evaluations on popular benchmarks demonstrate that Cate KV reduces memory usage and decoding latency by 2.72 and 2.18 for single inputs, increases throughput by 3.96 in batch scenarios, while maintaining performance comparable to full attention. |
| Researcher Affiliation | Collaboration | 1CMIC, Shanghai Jiao Tong University 2Alibaba Group 3Fudan University. Correspondence to: Jiangchao Yao <EMAIL>, Shuai Xiao <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Cate KV in an individual Head |
| Open Source Code | No | The paper mentions using code from other works (e.g., Duoattention) but does not provide an explicit statement or link for the open-source release of its own methodology's code. |
| Open Datasets | Yes | Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, Cate KV reduces memory usage by up to 2.72 and accelerates decoding by 2.18 in single-sample inputs, and boosts throughput by 3.96 in batch scenarios. We conducted extensive experiments on widely used benchmarks including RULER (Hsieh et al., 2024), Longbench (Bai et al., 2023b), and NIAH (Kamradt, 2024), using models such as LLa MA-3-8B-Instruct-1048K (Gradient, 2024a), GLM-4-9B-1M (GLM et al., 2024a), LLa MA-3.18B (Meta AI, 2024) and Yi-9B-200K (AI et al., 2024) to demonstrate the effectiveness. |
| Dataset Splits | Yes | In this experiment, we test 12 synthetic tasks under the context of 128K, with each task including 96 samples. We built a reference set for head identification of Cate KV by emulating Variable Tracking task from RULER, which is very distinct from the test set. The reference dataset based on the Variable Tracking task from the RULER Benchmark, which comprises 100 samples, each with a length of 128K, distinct from the test set. |
| Hardware Specification | Yes | The experiments were carried out on a single NVIDIA A100-80G GPU. As shown in Table 4, under the generic settings of r = 0.4 and η = 1.0, the Phi-3 model achieved reductions of 2.11 in memory and 1.79 in latency by using Cate KV. By balancing efficiency and accuracy, Cate KV further reduced memory usage by 2.72 and decoding latency by 2.18 on Llama-3, with accuracy decline on RULER-128K and Longbench tasks under 0.25%. |
| Software Dependencies | No | The paper mentions models and technologies like Flash Attention, but does not provide specific version numbers for software libraries or dependencies used in their implementation. |
| Experiment Setup | Yes | In Cate KV, we set the adaptive ratio r and retention ratio η to 0.4 and 1.0 respectively. The budget for consistent heads is set to 512. In Cate KV , we set the adaptive head ratio r to 0.4, the retention ratio η to 1.0 and allocate a sparse budget for consistent heads at 2048 (1.56%), retaining approximately 41% of the KV cache size. During the identification stage of Cate KV, we employed an observation window and temporarily excluded initial tokens and recent tokens from the context window. We set Lobs to 64, while Linit and Lrec were defined as 1/32 and 1/128 of the sparse budget, respectively. ... The sparse budget was set at 2048. ... For the percentile threshold k, Llama3 and Llama3.1 were set at 0.996 and 0.984, respectively, while other models were set at 0.99. For the scaling factor α, Llama3.1, and Yi were set at 0.8, while other models were assigned a value of 1.0. |