reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration

Authors: Haoyun Jiang, Haolin Li, Jianwei Zhang, Fei Huang, Qiang Hu, Minmin Sun, Shuai Xiao, Yong Li, Junyang Lin, Jiangchao Yao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, Cate KV reduces memory usage by up to 2.72 and accelerates decoding by 2.18 in single-sample inputs, and boosts throughput by 3.96 in batch scenarios. Comprehensive evaluations on popular benchmarks demonstrate that Cate KV reduces memory usage and decoding latency by 2.72 and 2.18 for single inputs, increases throughput by 3.96 in batch scenarios, while maintaining performance comparable to full attention.
Researcher Affiliation	Collaboration	1CMIC, Shanghai Jiao Tong University 2Alibaba Group 3Fudan University. Correspondence to: Jiangchao Yao <EMAIL>, Shuai Xiao <EMAIL>.
Pseudocode	Yes	Algorithm 1 Cate KV in an individual Head
Open Source Code	No	The paper mentions using code from other works (e.g., Duoattention) but does not provide an explicit statement or link for the open-source release of its own methodology's code.
Open Datasets	Yes	Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, Cate KV reduces memory usage by up to 2.72 and accelerates decoding by 2.18 in single-sample inputs, and boosts throughput by 3.96 in batch scenarios. We conducted extensive experiments on widely used benchmarks including RULER (Hsieh et al., 2024), Longbench (Bai et al., 2023b), and NIAH (Kamradt, 2024), using models such as LLa MA-3-8B-Instruct-1048K (Gradient, 2024a), GLM-4-9B-1M (GLM et al., 2024a), LLa MA-3.18B (Meta AI, 2024) and Yi-9B-200K (AI et al., 2024) to demonstrate the effectiveness.
Dataset Splits	Yes	In this experiment, we test 12 synthetic tasks under the context of 128K, with each task including 96 samples. We built a reference set for head identification of Cate KV by emulating Variable Tracking task from RULER, which is very distinct from the test set. The reference dataset based on the Variable Tracking task from the RULER Benchmark, which comprises 100 samples, each with a length of 128K, distinct from the test set.
Hardware Specification	Yes	The experiments were carried out on a single NVIDIA A100-80G GPU. As shown in Table 4, under the generic settings of r = 0.4 and η = 1.0, the Phi-3 model achieved reductions of 2.11 in memory and 1.79 in latency by using Cate KV. By balancing efficiency and accuracy, Cate KV further reduced memory usage by 2.72 and decoding latency by 2.18 on Llama-3, with accuracy decline on RULER-128K and Longbench tasks under 0.25%.
Software Dependencies	No	The paper mentions models and technologies like Flash Attention, but does not provide specific version numbers for software libraries or dependencies used in their implementation.
Experiment Setup	Yes	In Cate KV, we set the adaptive ratio r and retention ratio η to 0.4 and 1.0 respectively. The budget for consistent heads is set to 512. In Cate KV , we set the adaptive head ratio r to 0.4, the retention ratio η to 1.0 and allocate a sparse budget for consistent heads at 2048 (1.56%), retaining approximately 41% of the KV cache size. During the identification stage of Cate KV, we employed an observation window and temporarily excluded initial tokens and recent tokens from the context window. We set Lobs to 64, while Linit and Lrec were defined as 1/32 and 1/128 of the sparse budget, respectively. ... The sparse budget was set at 2048. ... For the percentile threshold k, Llama3 and Llama3.1 were set at 0.996 and 0.984, respectively, while other models were set at 0.99. For the scaling factor α, Llama3.1, and Yi were set at 0.8, while other models were assigned a value of 1.0.