Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

Authors: Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on various models (Llama-3-8B, Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, Mistral-7B-Instruct, Qwen2.5-14B-Instruct), using dictionaries trained on Wiki Text-103, as done in Section 2.3. To assess the effectiveness of Lexico in memory reduction while maintaining long-context understanding, we conduct experiments on selected tasks from Long Bench (Bai et al., 2023), following the setup of Liu et al. (2024b). See Table 7 in Appendix B for task details. Additionally, we evaluate generative performance on complex reasoning tasks, such as GSM8K (Cobbe et al., 2021) with 5-shot prompting and MMLU-Pro Engineering/Law (Wang et al., 2024a) with zero-shot chain-of-thought.
Researcher Affiliation Collaboration 1KRAFTON 2University of Wisconsin-Madison 3Microsoft Research. Correspondence to: Dimitris Papailiopoulos <EMAIL>.
Pseudocode Yes Algorithm 1 illustrates a naive implementation of OMP for understanding. In Lexico, we adopt the implementation of OMP v0 proposed by (Zhu et al., 2020), which minimizes computational complexity using efficient inverse Cholesky factorization. Additionally, we integrate methods from (Lubonja et al., 2024) for batched GPU execution and extend the implementation to handle multiple dictionaries in parallel. Algorithm 1 OMP ... Algorithm 2 Prefilling and decoding with Lexico
Open Source Code Yes Our code is available at https: //github.com/krafton-ai/lexico.
Open Datasets Yes For our experiments, we train a dictionary on Wiki Text-103 (Merity, 2016) for each model. This dictionary is only trained once and used universally across all tasks. ... We conduct experiments on selected tasks from Long Bench (Bai et al., 2023)... Additionally, we evaluate generative performance on complex reasoning tasks, such as GSM8K (Cobbe et al., 2021) with 5-shot prompting and MMLU-Pro Engineering/Law (Wang et al., 2024a) with zero-shot chain-of-thought.
Dataset Splits No The paper mentions training dictionaries on Wiki Text-103 and evaluating on tasks like GSM8K with '5-shot prompting' and MMLU-Pro with 'zero-shot chain-of-thought', but it does not provide specific train/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined splits for partitioning the datasets themselves) needed to reproduce the data partitioning for these experiments or for dictionary training.
Hardware Specification Yes Table 1 summarises training time for Llama-3.1-8B-Instruct on a single NVIDIA A100 at different sparsity s and dictionary size N.
Software Dependencies No The paper mentions using 'Adam (Kingma & Ba, 2014)' as an optimizer and refers to academic papers for implementation details of OMP (Zhu et al., 2020; Lubonja et al., 2024), but it does not specify any software libraries or frameworks with their exact version numbers (e.g., Python 3.x, PyTorch 1.x) that were used to implement the methodology.
Experiment Setup Yes The dictionaries are trained on KV pairs generated from the Wiki Text-103 dataset using Adam (Kingma & Ba, 2014) with a learning rate of 0.0001 and a cosine decay schedule over 20 epochs. ... For both experiments, Lexico uses a dictionary size of N = 4096, a buffer size of nb = 128, and an approximation window size na = 1, compressing the oldest token with each new token generated. For KIVI-4 and KIVI-2, we use a quantization group size of g = 32 and a buffer size of nb = 128 ... For GSM8K and MMLU-Pro, we test for stronger memory savings, so we use g = 64 and nb = 64 for KIVI.