MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
Authors: Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, TianQi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically witness the high data efficiency of our training procedure and find that our method can sustain over 90% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLa MA2-7B-base and Mistral-7B-v0.3-base. (...) 5 EXPERIMENTS In this section, we conduct experiments on continual pre-training (CPT) and supervised fine-tuning (SFT) scenarios to demonstrate that our Matryoshka KV can not only preserve the foundation knowledge of a base model but also be compatible with Lo RA (Hu et al., 2021) for downstream tasks. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2Huawei 3University of California, San Diego |
| Pseudocode | Yes | Algorithm 1: Greedy search for adaptive compression levels in our efficient LLM. |
| Open Source Code | Yes | The code is available at https://github.com/The-kamisato/Matryoshka KV-cache.git. |
| Open Datasets | Yes | We conduct continual pre-training (Ke et al., 2023) using the Red Pajama dataset (Computer, 2023). (...) PIQA (Bisk et al., 2019), ARC-challenge (ARC-C) (Clark et al., 2018), ARC-easy (ARC-E) (Clark et al., 2018), Wino Grande (WG) (Sakaguchi et al., 2019), Hella Swag (HLSG) (Zellers et al., 2019), and Common Sense QA (CSQA) (Talmor et al., 2019). (...) OBQA (Mihaylov et al., 2018), GSM8K (Cobbe et al., 2021). |
| Dataset Splits | No | The paper mentions using subsets of the Red Pajama dataset and various benchmarks like PIQA, GSM8K, and ARC-challenge. It evaluates on 'zero-shot benchmarks' and performs 'supervised fine-tuning' implying training and test sets. However, it does not explicitly state specific percentages, absolute counts for splits, or cite predefined split methodologies for any of the datasets used. For SFT, it mentions 'standard SFT practices' but without specific split details. |
| Hardware Specification | No | We train with a total of 30 GPU hours, processing just under 200 million tokens (20% of the Red Pajama sample 1T, i.e. 0.02% of the full Red Pajama dataset). The text mentions 'GPU hours' but does not specify any particular GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper does not explicitly list any specific software dependencies with version numbers, such as Python, PyTorch, or CUDA versions, that would be required to replicate the experiment. |
| Experiment Setup | Yes | We adopt the Matryoshka training strategy detailed in Section 4.2 and fine-tune Matryoshka KV projections with knowledge distillation loss in Equation 1 and language modeling loss, applying a 1:3 weighting ratio between the two losses. The projection rank rk and rv are randomly sampled from a predefined schedule set { i 8d}8 i=1 during training and are chosen dynamically with the greedy search for adaptive compression levels, as detailed in Section 4.3 during inference. During the greedy search for adaptive compression levels, we define the compression rate interval r = d/8 where the head dimension d for each attention head in LLa MA2-7B-base is 128. (...) We design a two-stage training strategy to make Matryoshka training strategy compatible with Lo RA (Hu et al., 2021) fine-tuning. |