reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Authors: Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, TianQi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically witness the high data efficiency of our training procedure and find that our method can sustain over 90% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLa MA2-7B-base and Mistral-7B-v0.3-base. (...) 5 EXPERIMENTS In this section, we conduct experiments on continual pre-training (CPT) and supervised fine-tuning (SFT) scenarios to demonstrate that our Matryoshka KV can not only preserve the foundation knowledge of a base model but also be compatible with Lo RA (Hu et al., 2021) for downstream tasks.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University 2Huawei 3University of California, San Diego
Pseudocode	Yes	Algorithm 1: Greedy search for adaptive compression levels in our efficient LLM.
Open Source Code	Yes	The code is available at https://github.com/The-kamisato/Matryoshka KV-cache.git.
Open Datasets	Yes	We conduct continual pre-training (Ke et al., 2023) using the Red Pajama dataset (Computer, 2023). (...) PIQA (Bisk et al., 2019), ARC-challenge (ARC-C) (Clark et al., 2018), ARC-easy (ARC-E) (Clark et al., 2018), Wino Grande (WG) (Sakaguchi et al., 2019), Hella Swag (HLSG) (Zellers et al., 2019), and Common Sense QA (CSQA) (Talmor et al., 2019). (...) OBQA (Mihaylov et al., 2018), GSM8K (Cobbe et al., 2021).
Dataset Splits	No	The paper mentions using subsets of the Red Pajama dataset and various benchmarks like PIQA, GSM8K, and ARC-challenge. It evaluates on 'zero-shot benchmarks' and performs 'supervised fine-tuning' implying training and test sets. However, it does not explicitly state specific percentages, absolute counts for splits, or cite predefined split methodologies for any of the datasets used. For SFT, it mentions 'standard SFT practices' but without specific split details.
Hardware Specification	No	We train with a total of 30 GPU hours, processing just under 200 million tokens (20% of the Red Pajama sample 1T, i.e. 0.02% of the full Red Pajama dataset). The text mentions 'GPU hours' but does not specify any particular GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or memory specifications used for the experiments.
Software Dependencies	No	The paper does not explicitly list any specific software dependencies with version numbers, such as Python, PyTorch, or CUDA versions, that would be required to replicate the experiment.
Experiment Setup	Yes	We adopt the Matryoshka training strategy detailed in Section 4.2 and fine-tune Matryoshka KV projections with knowledge distillation loss in Equation 1 and language modeling loss, applying a 1:3 weighting ratio between the two losses. The projection rank rk and rv are randomly sampled from a predefined schedule set { i 8d}8 i=1 during training and are chosen dynamically with the greedy search for adaptive compression levels, as detailed in Section 4.3 during inference. During the greedy search for adaptive compression levels, we define the compression rate interval r = d/8 where the head dimension d for each attention head in LLa MA2-7B-base is 128. (...) We design a two-stage training strategy to make Matryoshka training strategy compatible with Lo RA (Hu et al., 2021) fine-tuning.