A Survey on Large Language Model Acceleration based on KV Cache Management

Authors: Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, Lei Chen

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications. Additionally, the survey provides an overview of both text and multi-modal datasets and benchmarks used to evaluate these strategies.
Researcher Affiliation Academia 1The Hong Kong Polytechnic University 2The Hong Kong University of Science and Technology 3Huazhong University of Science and Technology 4The Chinese University of Hong Kong 5Nanyang Technological University. Emails: EMAIL EMAIL EMAIL EMAIL EMAIL EMAIL EMAIL EMAIL EMAIL
Pseudocode No The paper contains mathematical equations and descriptive text for algorithms, but no clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, step-by-step instructions in a code-like format.
Open Source Code No The curated paper list for KV cache management is in: https://github.com/Tree AI-Lab/Awesome-KV-Cache-Management. The provided link is for a "curated paper list," not for open-source code implementing a methodology described in this survey paper. A survey paper typically reviews existing work and does not propose new computational methods requiring code release.
Open Datasets Yes We collect a lot of long-context datasets, such as Numeric Bench (Li et al., 2025)and Long Bench (Bai et al., 2023). We categorize these datasets into various tasks, including question answering, text summarization, text reasoning, text retrieval, text generation, and aggregation. LLa VA-Bench (Liu et al., 2023b) is structured around image-ground-truth textual description-question-answer triplets, segmented across COCO and In-The-Wild datasets.
Dataset Splits No This paper is a survey that reviews existing KV cache management strategies and benchmarks, rather than presenting new experimental results that would require specific dataset splits for reproduction.
Hardware Specification No The paper is a survey of existing research and does not describe experiments performed by its authors. Therefore, it does not specify the hardware used for its own experimental runs. Section 6.3 discusses "Hardware-aware Design" in general terms for LLM inference, not for the survey's own experimental setup.
Software Dependencies No The paper is a survey and does not present new experimental results requiring specific software dependencies for reproduction. It discusses various software frameworks and libraries in the context of reviewed works, but not for its own methodology.
Experiment Setup No As a survey paper, this work analyzes and categorizes existing research on KV cache management. It does not present new experimental results or detailed experimental setups, such as hyperparameters or system-level training settings, from its own research.