TreeKV: Smooth Key-Value Cache Compression with Tree Structures

Authors: Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, Bo Jiang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Tree KV consistently surpasses all baseline models in language modeling tasks on PG19 and Open Web Text2, allowing LLMs trained with short context window to generalize to longer window with a 16x cache reduction. On the Longbench benchmark, Tree KV achieves the best performance with only 6% of the budget at optimal efficiency. Our ablation study further proved the tree structure s significant role in shaping the model s decision making. We provide extensive experimental results that validate the effectiveness of Tree KV in both prefilling and generation stages.
Researcher Affiliation Collaboration Ziwei He1, Jian Yuan1, Haoli Bai2, Jingwen Leng1, Bo Jiang1 1Shanghai Jiao Tong University 2Huawei Noah s Ark Lab EMAIL
Pseudocode Yes Algorithm 1 Compression by Tree KV
Open Source Code No The paper mentions baselines using their officially released code, but does not provide any statement or link regarding the open-sourcing of Tree KV's code.
Open Datasets Yes Our evaluation of Tree KV demonstrates its superiority over existing methods in both prefilling and decoding phases. We first assess its performance on the language modeling task with PG19 [Rae et al., 2019] and Open Web Text2 [Gao et al., 2020] for long text decoding. On the Longbench benchmark, Tree KV achieves the best performance with only 6% of the budget at optimal efficiency. We conduct experiments on long context understanding tasks using the Longbench [Bai et al., 2023] benchmark.
Dataset Splits No The paper mentions using 'PG19 test set' and 'Open Web Text2' from which 'we randomly selected 100 samples from the test set', and evaluates 'on the Longbench benchmark'. While these imply predefined test sets, it does not explicitly provide training/validation/test splits, percentages, or absolute counts required for full reproduction of data partitioning.
Hardware Specification Yes All the experiments utilize bf16 precision on Nvidia RTX4090 GPUs.
Software Dependencies No The paper does not explicitly mention any specific software dependencies with version numbers.
Experiment Setup Yes The cache size of all the efficient methods is set to 1024. We evaluate perplexity using a sliding window approach with a stride of 2048 for PG19 and 1024 for Open Web Text2 respectively. For language understanding tasks on Longbench, we truncate the inputs to 32k in the same manner as Snap KV [Li et al., 2024]. We use Llama-2-7B [Touvron et al., 2023] pre-trained with 4K context length as base model considering its popularity and outstanding performance. We employ Llama-3.2-1B-Instruct as our base models.