TreeKV: Smooth Key-Value Cache Compression with Tree Structures
Authors: Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, Bo Jiang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Tree KV consistently surpasses all baseline models in language modeling tasks on PG19 and Open Web Text2, allowing LLMs trained with short context window to generalize to longer window with a 16x cache reduction. On the Longbench benchmark, Tree KV achieves the best performance with only 6% of the budget at optimal efficiency. Our ablation study further proved the tree structure s significant role in shaping the model s decision making. We provide extensive experimental results that validate the effectiveness of Tree KV in both prefilling and generation stages. |
| Researcher Affiliation | Collaboration | Ziwei He1, Jian Yuan1, Haoli Bai2, Jingwen Leng1, Bo Jiang1 1Shanghai Jiao Tong University 2Huawei Noah s Ark Lab EMAIL |
| Pseudocode | Yes | Algorithm 1 Compression by Tree KV |
| Open Source Code | No | The paper mentions baselines using their officially released code, but does not provide any statement or link regarding the open-sourcing of Tree KV's code. |
| Open Datasets | Yes | Our evaluation of Tree KV demonstrates its superiority over existing methods in both prefilling and decoding phases. We first assess its performance on the language modeling task with PG19 [Rae et al., 2019] and Open Web Text2 [Gao et al., 2020] for long text decoding. On the Longbench benchmark, Tree KV achieves the best performance with only 6% of the budget at optimal efficiency. We conduct experiments on long context understanding tasks using the Longbench [Bai et al., 2023] benchmark. |
| Dataset Splits | No | The paper mentions using 'PG19 test set' and 'Open Web Text2' from which 'we randomly selected 100 samples from the test set', and evaluates 'on the Longbench benchmark'. While these imply predefined test sets, it does not explicitly provide training/validation/test splits, percentages, or absolute counts required for full reproduction of data partitioning. |
| Hardware Specification | Yes | All the experiments utilize bf16 precision on Nvidia RTX4090 GPUs. |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies with version numbers. |
| Experiment Setup | Yes | The cache size of all the efficient methods is set to 1024. We evaluate perplexity using a sliding window approach with a stride of 2048 for PG19 and 1024 for Open Web Text2 respectively. For language understanding tasks on Longbench, we truncate the inputs to 32k in the same manner as Snap KV [Li et al., 2024]. We use Llama-2-7B [Touvron et al., 2023] pre-trained with 4K context length as base model considering its popularity and outstanding performance. We employ Llama-3.2-1B-Instruct as our base models. |