ThinK: Thinner Key Cache by Query-Driven Pruning
Authors: Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on the LLa MA and Mistral models across various long-sequence datasets verified the efficiency of THINK. |
| Researcher Affiliation | Collaboration | 1Salesforce AI Research 2 The Chinese University of Hong Kong |
| Pseudocode | No | The paper describes methods using mathematical formulations and prose. It does not contain an explicitly labeled 'Pseudocode' or 'Algorithm' section, nor does it present structured, step-by-step procedures in a code-like format. |
| Open Source Code | Yes | Our code has been made available at https://github.com/Salesforce AIResearch/Thin K. |
| Open Datasets | Yes | We evaluate our proposed method against state-of-the-art KV cache compression methods on two widely recognized benchmarks: Long Bench and Needle-in-a-Haystack. Long Bench (Bai et al., 2023) is designed to comprehensively assess the long context understanding capabilities of LLMs... Needle-in-a-Haystack (Kamradt, 2023) is a recently developed benchmark... |
| Dataset Splits | No | The paper evaluates models on established benchmarks like Long Bench and Needle-in-a-Haystack, but does not provide specific training/test/validation dataset splits, percentages, or sample counts used for these benchmarks within the text. |
| Hardware Specification | Yes | All the experiments are conducted using NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using specific LLM models (LLa MA-2/3, Mistral) accessible via Hugging Face, but does not provide specific version numbers for any software dependencies like Hugging Face, PyTorch, Python, or CUDA. |
| Experiment Setup | Yes | For instance, when comparing Snap KV and Snap KV integrated with THINK, we used a maximum pooling kernel size of 7 and an observation window size of 32, maintaining the same KV-size for both configurations. ... We generate synthetic workloads with an input prompt length of 160 and an output length of 338. We set a batch size 300 for both KIVI and our method. |