reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ThinK: Thinner Key Cache by Query-Driven Pruning

Authors: Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations on the LLa MA and Mistral models across various long-sequence datasets verified the efficiency of THINK.
Researcher Affiliation	Collaboration	1Salesforce AI Research 2 The Chinese University of Hong Kong
Pseudocode	No	The paper describes methods using mathematical formulations and prose. It does not contain an explicitly labeled 'Pseudocode' or 'Algorithm' section, nor does it present structured, step-by-step procedures in a code-like format.
Open Source Code	Yes	Our code has been made available at https://github.com/Salesforce AIResearch/Thin K.
Open Datasets	Yes	We evaluate our proposed method against state-of-the-art KV cache compression methods on two widely recognized benchmarks: Long Bench and Needle-in-a-Haystack. Long Bench (Bai et al., 2023) is designed to comprehensively assess the long context understanding capabilities of LLMs... Needle-in-a-Haystack (Kamradt, 2023) is a recently developed benchmark...
Dataset Splits	No	The paper evaluates models on established benchmarks like Long Bench and Needle-in-a-Haystack, but does not provide specific training/test/validation dataset splits, percentages, or sample counts used for these benchmarks within the text.
Hardware Specification	Yes	All the experiments are conducted using NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using specific LLM models (LLa MA-2/3, Mistral) accessible via Hugging Face, but does not provide specific version numbers for any software dependencies like Hugging Face, PyTorch, Python, or CUDA.
Experiment Setup	Yes	For instance, when comparing Snap KV and Snap KV integrated with THINK, we used a maximum pooling kernel size of 7 and an observation window size of 32, maintaining the same KV-size for both configurations. ... We generate synthetic workloads with an input prompt length of 160 and an output length of 338. We set a batch size 300 for both KIVI and our method.