SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Authors: Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Long Bench and Needle-in-a-Haystack benchmarks verify that SPECACHE effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10 high KV cache compression ratio.
Researcher Affiliation Collaboration 1State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Huawei Noah s Ark Lab 3School of Artificial Intelligence, Beijing University of Posts and Telecommunications. Correspondence to: Zhi-Hong Deng <EMAIL>, Jing Han <EMAIL>.
Pseudocode Yes A. Algorithm We provide the pseudocode for a single attention layer. For simplicity, we have omitted the residual and grouped quantization details of KIVI. Algorithm 1 Prefilling Algorithm 2 Pre-decoding Algorithm 3 Decoding
Open Source Code No The paper mentions that "Huggingface's transformers (Wolf et al., 2019) library also implements a simple offloaded KV cache" and that the evaluation uses "Hugging Face transformers as framework". It also describes their implementation details: "Note that our CPU-GPU interaction code is implemented using pytorch s multi-stream mechanism and the Tensor.copy () method, so the parallelism achieved is not theoretically optimal. By customizing lower-level operators, the efficiency of SPECACHE can be further improved." However, it does not explicitly state that the authors are releasing their own code for SPECACHE, nor does it provide a link to a code repository.
Open Datasets Yes We conducted experiments on various LLMs using the Long Bench (Bai et al., 2024) and Needle-in-a-Haystack benchmarks (Greg Kamradt, 2023). We conducted experiments on the LLa MA-3-8B model using the PG19 dataset truncated to a sequence length of 8196.
Dataset Splits No The paper mentions using the PG19 dataset truncated to a sequence length of 8196 and evaluates on the Long Bench and Needle-in-a-Haystack benchmarks. For the Needle-in-a-Haystack benchmark, it describes how the 'needle' sentence was inserted into Paul Graham's essays. However, it does not provide explicit training, validation, or test dataset splits, percentages, or methodology for these datasets beyond their use as benchmarks or a truncation method.
Hardware Specification Yes Evaluated on a single NVIDIA A6000 GPU using Mistral-7B-Instruct-v0.2. ...maximize GPU memory usage up to the 48GB VRAM of an NVIDIA A6000 GPU.
Software Dependencies No The paper mentions using "Hugging Face transformers as framework" and describes CPU-GPU interaction code implemented using "pytorch s multi-stream mechanism and the Tensor.copy () method". However, it does not specify version numbers for Hugging Face transformers, PyTorch, or any other software components.
Experiment Setup Yes Specifically, for the baseline KIVI method, we use 128 residual KV pairs and quantization group size of 32 and 64. For SPECACHE, we prefetch top-64 16-bit KV pairs from CPU memory. Since these 16-bit KV pairs will be loaded into VRAM, in order to ensure that the total size of the KV cache remains unchanged, we use a smaller residual length of 64. We conduct an ablation study on the number of prefetched KV pairs, k. As shown in the Figure 5, even with a small k, such as k = 16, SPECACHE still provides a significant improvement over the KIVI baseline (i.e., k = 0).