ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Authors: Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we conduct extensive experiments and ablation studies to demonstrate the effectiveness and efficiency of SHADOWKV. In Section 5.1, we evaluate across various long-context LLMs, such as Llama-3-8B-1M (Gradient., 2024), Llama-3.1-8B (Meta AI, 2024), GLM-4-9B-1M (GLM et al., 2024), Yi-9B-200K (AI et al., 2024), Phi-3Mini-128K (Abdin et al., 2024) and Qwen2-7B-128K (Yang et al., 2024a) using benchmarks including RULER (Hsieh et al., 2024), Long Bench (Bai et al., 2023), and Needle In A Haystack (Kamradt, 2023) with contexts up to 1M. In Section 5.2, we demonstrate that SHADOWKV can support 6 larger batch sizes and boost throughput by up to 3.04 compared to small batches on an A100 using Llama3.1-8B. We also present results across different models and context lengths, increasing throughput up to 2.97 for Llama-3-8B-1M, 2.56 for GLM-4-9B-1M, and 2.66 for Yi-9B-200K, even surpassing infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/Byte Dance-Seed/ Shadow KV. |
| Researcher Affiliation | Collaboration | Hanshi Sun 1 2 Li-Wen Chang 1 Wenlei Bao 1 Size Zheng 1 Ningxin Zheng 1 Xin Liu 1 Harry Dong 2 Yuejie Chi 2 Beidi Chen 2 With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of highthroughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for decoding both result in low throughput when serving long-context LLMs. 1Byte Dance Seed 2Carnegie Mellon University. Correspondence to: Hanshi Sun <EMAIL>, Beidi Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SHADOWKV Pre-filling Input: K, KRo PE, V Rb hkv s d, SVD rank r, chunk size c, number of outlier chunks o Store low-rank projection of pre-Ro PE key cache A Rb s r, B Rb hkv r d SVD(K) Segment post-Ro PE key cache into chunks and compute the mean of each chunk C Rb hkv s/c d Reduce(KRo PE) Compute cosine similarity within each chunk S Rb hkv s/c c Cosine Similarity(C, KRo PE) Find lowest cosine similarity as outliers I Rb hkv o Arg Top K( Min(S, dim = 1), o) Koutlier, V outlier Gather(KRo PE, V , I) Offload the rest of values to the CPU and store the non-outlier chunks mean as landmarks V CPU V \ V outlier; L C \ Gather(C, I) Algorithm 2 SHADOWKV Decoding Input: A, B, L, V CPU, Q Rb hq sq d, Koutlier, V outlier, K, V Rb hkv sq d, number of chunks nc, number of selected chunk budget k Compute chunk attention score P Rb hq sq nc Mat Mul(Q, L ) S Rb hq sq nc Softmax(P / d) S1 Rb hq nc sum(S, dim = 2) S2 Rb hkv nc maxkv group(S1) Select top-k chunks for each KV head I Rb hkv k Arg Top K(S2, k) Gather value cache from the CPU V sparse Gather(V CPU, I) ; V [V outlier; V sparse; V ] Reconstruct key cache from low-rank projection Ksparse Mat Mul(Gather(A, I), B) K [Koutlier; Ro PE(Ksparse); K] |
| Open Source Code | Yes | The code is available at https://github.com/Byte Dance-Seed/ Shadow KV. |
| Open Datasets | Yes | We evaluate our approach on three challenging long-context benchmarks: RULER (Hsieh et al., 2024), Long Bench (Bai et al., 2023), and Needle In A Haystack (NIAH) (Kamradt, 2023), covering QA, multi-hop, reasoning, summarization, code completion4. |
| Dataset Splits | No | We only test on samples longer than 4K and set the sparse KV cache budget to 256 for this benchmark since it has shorter inputs compared to RULER. For the test with MInference (Jiang et al., 2024), we set up test sets scaling from 8K to 256K for evaluation. |
| Hardware Specification | Yes | By evaluating SHADOWKV on benchmarks like RULER, Long Bench, and models such as Llama3.1-8B and GLM-4-9B-1M, we demonstrate that it achieves up to 6 larger batch sizes and 3.04 higher throughput on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. |
| Software Dependencies | No | We implement the framework based on Py Torch (Paszke et al., 2019; Wolf, 2019) and dedicated kernels (Thakkar et al., 2023). Flash Attention (Dao et al., 2022; Dao, 2023; Hong et al., 2023) is used for attention computation and some efficient fused kernels in Flashinfer (Ye et al., 2024) and v LLM (Kwon et al., 2023) are used, including layer norm. |
| Experiment Setup | Yes | We set the chunk size to 8, the rank to 160, and the number of outliers to 48 for SHADOWKV. We set the sparse KV cache budget to 256 for this benchmark since it has shorter inputs compared to RULER. We set the sparse budget to 1.56% for SHADOWKV. |