reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Authors: Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we conduct extensive experiments and ablation studies to demonstrate the effectiveness and efficiency of SHADOWKV. In Section 5.1, we evaluate across various long-context LLMs, such as Llama-3-8B-1M (Gradient., 2024), Llama-3.1-8B (Meta AI, 2024), GLM-4-9B-1M (GLM et al., 2024), Yi-9B-200K (AI et al., 2024), Phi-3Mini-128K (Abdin et al., 2024) and Qwen2-7B-128K (Yang et al., 2024a) using benchmarks including RULER (Hsieh et al., 2024), Long Bench (Bai et al., 2023), and Needle In A Haystack (Kamradt, 2023) with contexts up to 1M. In Section 5.2, we demonstrate that SHADOWKV can support 6 larger batch sizes and boost throughput by up to 3.04 compared to small batches on an A100 using Llama3.1-8B. We also present results across different models and context lengths, increasing throughput up to 2.97 for Llama-3-8B-1M, 2.56 for GLM-4-9B-1M, and 2.66 for Yi-9B-200K, even surpassing infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/Byte Dance-Seed/ Shadow KV.
Researcher Affiliation	Collaboration	Hanshi Sun 1 2 Li-Wen Chang 1 Wenlei Bao 1 Size Zheng 1 Ningxin Zheng 1 Xin Liu 1 Harry Dong 2 Yuejie Chi 2 Beidi Chen 2 With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of highthroughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for decoding both result in low throughput when serving long-context LLMs. 1Byte Dance Seed 2Carnegie Mellon University. Correspondence to: Hanshi Sun <EMAIL>, Beidi Chen <EMAIL>.
Pseudocode	Yes	Algorithm 1 SHADOWKV Pre-filling Input: K, KRo PE, V Rb hkv s d, SVD rank r, chunk size c, number of outlier chunks o Store low-rank projection of pre-Ro PE key cache A Rb s r, B Rb hkv r d SVD(K) Segment post-Ro PE key cache into chunks and compute the mean of each chunk C Rb hkv s/c d Reduce(KRo PE) Compute cosine similarity within each chunk S Rb hkv s/c c Cosine Similarity(C, KRo PE) Find lowest cosine similarity as outliers I Rb hkv o Arg Top K( Min(S, dim = 1), o) Koutlier, V outlier Gather(KRo PE, V , I) Offload the rest of values to the CPU and store the non-outlier chunks mean as landmarks V CPU V \ V outlier; L C \ Gather(C, I) Algorithm 2 SHADOWKV Decoding Input: A, B, L, V CPU, Q Rb hq sq d, Koutlier, V outlier, K, V Rb hkv sq d, number of chunks nc, number of selected chunk budget k Compute chunk attention score P Rb hq sq nc Mat Mul(Q, L ) S Rb hq sq nc Softmax(P / d) S1 Rb hq nc sum(S, dim = 2) S2 Rb hkv nc maxkv group(S1) Select top-k chunks for each KV head I Rb hkv k Arg Top K(S2, k) Gather value cache from the CPU V sparse Gather(V CPU, I) ; V [V outlier; V sparse; V ] Reconstruct key cache from low-rank projection Ksparse Mat Mul(Gather(A, I), B) K [Koutlier; Ro PE(Ksparse); K]
Open Source Code	Yes	The code is available at https://github.com/Byte Dance-Seed/ Shadow KV.
Open Datasets	Yes	We evaluate our approach on three challenging long-context benchmarks: RULER (Hsieh et al., 2024), Long Bench (Bai et al., 2023), and Needle In A Haystack (NIAH) (Kamradt, 2023), covering QA, multi-hop, reasoning, summarization, code completion4.
Dataset Splits	No	We only test on samples longer than 4K and set the sparse KV cache budget to 256 for this benchmark since it has shorter inputs compared to RULER. For the test with MInference (Jiang et al., 2024), we set up test sets scaling from 8K to 256K for evaluation.
Hardware Specification	Yes	By evaluating SHADOWKV on benchmarks like RULER, Long Bench, and models such as Llama3.1-8B and GLM-4-9B-1M, we demonstrate that it achieves up to 6 larger batch sizes and 3.04 higher throughput on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory.
Software Dependencies	No	We implement the framework based on Py Torch (Paszke et al., 2019; Wolf, 2019) and dedicated kernels (Thakkar et al., 2023). Flash Attention (Dao et al., 2022; Dao, 2023; Hong et al., 2023) is used for attention computation and some efficient fused kernels in Flashinfer (Ye et al., 2024) and v LLM (Kwon et al., 2023) are used, including layer norm.
Experiment Setup	Yes	We set the chunk size to 8, the rank to 160, and the number of outliers to 48 for SHADOWKV. We set the sparse KV cache budget to 256 for this benchmark since it has shorter inputs compared to RULER. We set the sparse budget to 1.56% for SHADOWKV.