reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Authors: Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Long Bench and Needle-in-a-Haystack benchmarks verify that SPECACHE effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10 high KV cache compression ratio.
Researcher Affiliation	Collaboration	1State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Huawei Noah s Ark Lab 3School of Artificial Intelligence, Beijing University of Posts and Telecommunications. Correspondence to: Zhi-Hong Deng <EMAIL>, Jing Han <EMAIL>.
Pseudocode	Yes	A. Algorithm We provide the pseudocode for a single attention layer. For simplicity, we have omitted the residual and grouped quantization details of KIVI. Algorithm 1 Prefilling Algorithm 2 Pre-decoding Algorithm 3 Decoding
Open Source Code	No	The paper mentions that "Huggingface's transformers (Wolf et al., 2019) library also implements a simple offloaded KV cache" and that the evaluation uses "Hugging Face transformers as framework". It also describes their implementation details: "Note that our CPU-GPU interaction code is implemented using pytorch s multi-stream mechanism and the Tensor.copy () method, so the parallelism achieved is not theoretically optimal. By customizing lower-level operators, the efficiency of SPECACHE can be further improved." However, it does not explicitly state that the authors are releasing their own code for SPECACHE, nor does it provide a link to a code repository.
Open Datasets	Yes	We conducted experiments on various LLMs using the Long Bench (Bai et al., 2024) and Needle-in-a-Haystack benchmarks (Greg Kamradt, 2023). We conducted experiments on the LLa MA-3-8B model using the PG19 dataset truncated to a sequence length of 8196.
Dataset Splits	No	The paper mentions using the PG19 dataset truncated to a sequence length of 8196 and evaluates on the Long Bench and Needle-in-a-Haystack benchmarks. For the Needle-in-a-Haystack benchmark, it describes how the 'needle' sentence was inserted into Paul Graham's essays. However, it does not provide explicit training, validation, or test dataset splits, percentages, or methodology for these datasets beyond their use as benchmarks or a truncation method.
Hardware Specification	Yes	Evaluated on a single NVIDIA A6000 GPU using Mistral-7B-Instruct-v0.2. ...maximize GPU memory usage up to the 48GB VRAM of an NVIDIA A6000 GPU.
Software Dependencies	No	The paper mentions using "Hugging Face transformers as framework" and describes CPU-GPU interaction code implemented using "pytorch s multi-stream mechanism and the Tensor.copy () method". However, it does not specify version numbers for Hugging Face transformers, PyTorch, or any other software components.
Experiment Setup	Yes	Specifically, for the baseline KIVI method, we use 128 residual KV pairs and quantization group size of 32 and 64. For SPECACHE, we prefetch top-64 16-bit KV pairs from CPU memory. Since these 16-bit KV pairs will be loaded into VRAM, in order to ensure that the total size of the KV cache remains unchanged, we use a smaller residual length of 64. We conduct an ablation study on the number of prefetched KV pairs, k. As shown in the Figure 5, even with a small k, such as k = 16, SPECACHE still provides a significant improvement over the KIVI baseline (i.e., k = 0).