reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TreeKV: Smooth Key-Value Cache Compression with Tree Structures

Authors: Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, Bo Jiang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Tree KV consistently surpasses all baseline models in language modeling tasks on PG19 and Open Web Text2, allowing LLMs trained with short context window to generalize to longer window with a 16x cache reduction. On the Longbench benchmark, Tree KV achieves the best performance with only 6% of the budget at optimal efficiency. Our ablation study further proved the tree structure s significant role in shaping the model s decision making. We provide extensive experimental results that validate the effectiveness of Tree KV in both prefilling and generation stages.
Researcher Affiliation	Collaboration	Ziwei He1, Jian Yuan1, Haoli Bai2, Jingwen Leng1, Bo Jiang1 1Shanghai Jiao Tong University 2Huawei Noah s Ark Lab EMAIL
Pseudocode	Yes	Algorithm 1 Compression by Tree KV
Open Source Code	No	The paper mentions baselines using their officially released code, but does not provide any statement or link regarding the open-sourcing of Tree KV's code.
Open Datasets	Yes	Our evaluation of Tree KV demonstrates its superiority over existing methods in both prefilling and decoding phases. We first assess its performance on the language modeling task with PG19 [Rae et al., 2019] and Open Web Text2 [Gao et al., 2020] for long text decoding. On the Longbench benchmark, Tree KV achieves the best performance with only 6% of the budget at optimal efficiency. We conduct experiments on long context understanding tasks using the Longbench [Bai et al., 2023] benchmark.
Dataset Splits	No	The paper mentions using 'PG19 test set' and 'Open Web Text2' from which 'we randomly selected 100 samples from the test set', and evaluates 'on the Longbench benchmark'. While these imply predefined test sets, it does not explicitly provide training/validation/test splits, percentages, or absolute counts required for full reproduction of data partitioning.
Hardware Specification	Yes	All the experiments utilize bf16 precision on Nvidia RTX4090 GPUs.
Software Dependencies	No	The paper does not explicitly mention any specific software dependencies with version numbers.
Experiment Setup	Yes	The cache size of all the efficient methods is set to 1024. We evaluate perplexity using a sliding window approach with a stride of 2048 for PG19 and 1024 for Open Web Text2 respectively. For language understanding tasks on Longbench, we truncate the inputs to 32k in the same manner as Snap KV [Li et al., 2024]. We use Llama-2-7B [Touvron et al., 2023] pre-trained with 4K context length as base model considering its popularity and outstanding performance. We employ Llama-3.2-1B-Instruct as our base models.