reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Authors: Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Richard Charles Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the performance of Quant Spec across multiple datasets and context lengths. Our evaluation focuses on three key dimensions: (1) the acceptance ratio between the draft and target models, (2) GPU memory consumption, and (3) end-to-end serving speedup. We begin by presenting a detailed benchmarking of acceptance rate, memory usage, and end-to-end speedup across different datasets. Then, we highlight the performance gains achieved by our custom kernels for quantized KV cache. Finally, We also present an extensive ablation study focusing on the contribution of weight versus KV cache quantization to the final speed-up.
Researcher Affiliation	Collaboration	1UC Berkeley 2Apple 3ICSI 4LBNL. Correspondence to: Amir Gholami <EMAIL>.
Pseudocode	Yes	Algorithm 1 Quant Spec Algorithm
Open Source Code	No	The paper does not explicitly state that the code for the described methodology is open-source, nor does it provide a link to a code repository.
Open Datasets	Yes	We evaluate Quant Spec using long-context variants of LLa MA-2 and LWM models as target models. For benchmarking decoding speedup, we use PG-19 (Rae et al., 2019), (an open-vocabulary language modeling benchmark derived from books) and two long context summarization datasets, namely BENCH Sum (Zhang et al., 2024c; Yen et al., 2024) and Multi-Lex Sum (Shen et al., 2022; Yen et al., 2024). More details about the datasets are provided in Appendix F. Following Sadhukhan et al. (2024), we compare against two recent sparse KV-based self-speculative decoding baselines: Streaming LLM (Sadhukhan et al., 2024; Xiao et al., 2023a) and Snap KV (Sadhukhan et al., 2024; Li et al., 2024a).
Dataset Splits	No	The paper mentions evaluating performance across different context lengths and for 10 different examples, but it does not specify explicit training, validation, or test dataset splits for the experiments conducted using PG-19, BENCH Sum, or Multi-Lex Sum datasets. It refers to 'optimal speculation length γ... determined through a hyperparameter search for each dataset-model pair' but does not detail dataset splits.
Hardware Specification	Yes	All experiments are performed on a node equipped with 8 NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper mentions using 'custom CUDA kernels' and refers to 'Flash Attention (Dao et al., 2022)' and 'Flash Decoding (Dao et al., 2023)', but it does not specify version numbers for CUDA, Flash Attention/Decoding libraries, or any other software dependencies like PyTorch, Python, etc., required for reproducibility.
Experiment Setup	Yes	We fix the quantization group size at 128, the residual length R for the KV cache at 256, and limit the number of output tokens to 90. The optimal speculation length γ for each dataset is determined through a hyperparameter search for each dataset-model pair. Details of the hyperparameter search are provided in Appendix G.