CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation

Authors: Hongxuan Zhang, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate that CSR matches the performance of state-of-the-art KV cache quantization algorithms while ensuring robust functionality in memory-constrained environments.
Researcher Affiliation Collaboration Hongxuan Zhang1, 2* , Yao Zhao2 , Jiaqi Zheng1, Chenyi Zhuang2, Jinjie Gu2, Guihai Chen1 1Nanjing University 2Ant Group x EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Neural Dict
Open Source Code No The text does not explicitly state that source code for the methodology is openly provided, nor does it provide a direct link to a code repository. Footnotes point to Hugging Face Transformers documentation (a third-party tool) and an arXiv preprint of their extended version, neither of which is a code repository for their specific implementation.
Open Datasets Yes We extracted a range of prompts from wikitext dataset(Merity et al. 2016)... We utilized the Long Bench benchmark (Bai et al. 2023), which is a bilingual and multitask benchmark designed to assess the long context understanding capabilities of LLM.
Dataset Splits No The paper mentions using a 'calibration corpus dataset' and a 'test dataset' for Neural Dict training and evaluation, and the 'Long Bench benchmark' for model evaluation. However, it does not specify the exact split percentages, sample counts, or the methodology used to create these splits for any of the datasets.
Hardware Specification Yes A single NVIDIA A100 GPU (80GB) with 128GB memory.
Software Dependencies No The paper states that LLMs are based on the 'Hugging Face Transformers library' but does not provide any specific version numbers for this library or any other software dependencies like Python or PyTorch.
Experiment Setup Yes In the experiments, unless stated otherwise, the Value Cache uses sn = 2, and the Key Cache uses sn = 1. For simplicity, CSR-s denotes the MP-level. In CSR s online part, the Guard size per layer is 8192, and the sampling size is 4096 for Llama2 and Baichuan2. For Llama3, the Guard size is 2048, with a sampling size of 1024.