ThinK: Thinner Key Cache by Query-Driven Pruning

Authors: Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on the LLa MA and Mistral models across various long-sequence datasets verified the efficiency of THINK.
Researcher Affiliation Collaboration 1Salesforce AI Research 2 The Chinese University of Hong Kong
Pseudocode No The paper describes methods using mathematical formulations and prose. It does not contain an explicitly labeled 'Pseudocode' or 'Algorithm' section, nor does it present structured, step-by-step procedures in a code-like format.
Open Source Code Yes Our code has been made available at https://github.com/Salesforce AIResearch/Thin K.
Open Datasets Yes We evaluate our proposed method against state-of-the-art KV cache compression methods on two widely recognized benchmarks: Long Bench and Needle-in-a-Haystack. Long Bench (Bai et al., 2023) is designed to comprehensively assess the long context understanding capabilities of LLMs... Needle-in-a-Haystack (Kamradt, 2023) is a recently developed benchmark...
Dataset Splits No The paper evaluates models on established benchmarks like Long Bench and Needle-in-a-Haystack, but does not provide specific training/test/validation dataset splits, percentages, or sample counts used for these benchmarks within the text.
Hardware Specification Yes All the experiments are conducted using NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using specific LLM models (LLa MA-2/3, Mistral) accessible via Hugging Face, but does not provide specific version numbers for any software dependencies like Hugging Face, PyTorch, Python, or CUDA.
Experiment Setup Yes For instance, when comparing Snap KV and Snap KV integrated with THINK, we used a maximum pooling kernel size of 7 and an observation window size of 32, maintaining the same KV-size for both configurations. ... We generate synthetic workloads with an input prompt length of 160 and an output length of 338. We set a batch size 300 for both KIVI and our method.