QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Authors: Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Richard Charles Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the performance of Quant Spec across multiple datasets and context lengths. Our evaluation focuses on three key dimensions: (1) the acceptance ratio between the draft and target models, (2) GPU memory consumption, and (3) end-to-end serving speedup. We begin by presenting a detailed benchmarking of acceptance rate, memory usage, and end-to-end speedup across different datasets. Then, we highlight the performance gains achieved by our custom kernels for quantized KV cache. Finally, We also present an extensive ablation study focusing on the contribution of weight versus KV cache quantization to the final speed-up.
Researcher Affiliation Collaboration 1UC Berkeley 2Apple 3ICSI 4LBNL. Correspondence to: Amir Gholami <EMAIL>.
Pseudocode Yes Algorithm 1 Quant Spec Algorithm
Open Source Code No The paper does not explicitly state that the code for the described methodology is open-source, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate Quant Spec using long-context variants of LLa MA-2 and LWM models as target models. For benchmarking decoding speedup, we use PG-19 (Rae et al., 2019), (an open-vocabulary language modeling benchmark derived from books) and two long context summarization datasets, namely BENCH Sum (Zhang et al., 2024c; Yen et al., 2024) and Multi-Lex Sum (Shen et al., 2022; Yen et al., 2024). More details about the datasets are provided in Appendix F. Following Sadhukhan et al. (2024), we compare against two recent sparse KV-based self-speculative decoding baselines: Streaming LLM (Sadhukhan et al., 2024; Xiao et al., 2023a) and Snap KV (Sadhukhan et al., 2024; Li et al., 2024a).
Dataset Splits No The paper mentions evaluating performance across different context lengths and for 10 different examples, but it does not specify explicit training, validation, or test dataset splits for the experiments conducted using PG-19, BENCH Sum, or Multi-Lex Sum datasets. It refers to 'optimal speculation length γ... determined through a hyperparameter search for each dataset-model pair' but does not detail dataset splits.
Hardware Specification Yes All experiments are performed on a node equipped with 8 NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions using 'custom CUDA kernels' and refers to 'Flash Attention (Dao et al., 2022)' and 'Flash Decoding (Dao et al., 2023)', but it does not specify version numbers for CUDA, Flash Attention/Decoding libraries, or any other software dependencies like PyTorch, Python, etc., required for reproducibility.
Experiment Setup Yes We fix the quantization group size at 128, the residual length R for the KV cache at 256, and limit the number of output tokens to 90. The optimal speculation length γ for each dataset is determined through a hyperparameter search for each dataset-model pair. Details of the hyperparameter search are provided in Appendix G.