Block Verification Accelerates Speculative Decoding
Authors: Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Jae H Ro, Ahmad Beirami, Ananda Theertha Suresh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically test block verification and compare it with the standard token verification on a range of tasks and datasets. We show that our algorithm consistently improves over block efficiency (i.e. the expected number of generated tokens) by 7%-10% and overall empirical wall clock times by 5%-8% (see Table 1). |
| Researcher Affiliation | Industry | Ziteng Sun Google Research zitengsun@ Uri Mendlovic Google Research urimend@ Yaniv Leviathan Google Research leviathan@ Asaf Aharoni Google Research asafaharoni@ Jae Hun Ro Google Research jaero@ Ahmad Beirami Google Research beirami@ Ananda Theertha Suresh Google Research theertha@ All emails @google.com. |
| Pseudocode | Yes | See Algorithm 2 for a sketch implementation of block verification, and Algorithm 1 for a sketch implementation of the standard token verification for comparison. Note that the implementations follow the same overall structure (the differences are highlighted). See Algorithm 3 for the outer loop of the speculative decoding algorithm, which remains unchanged for both verification methods. See Appendix A for sketch Python implementations. |
| Open Source Code | No | In this section we provide a sketch implementation of block verification (Algorithm 2) in Python. Note that these are meant for illustration purposes only and are not fit for practical use. |
| Open Datasets | Yes | For the experiments on PALM-2 models... We evaluate on prompts from a wide range of datasets and tasks, including language modeling with one-billion language benchmark (LM1B) (Chelba et al., 2013), Chat GPT prompts sourced from Learn GPT (GPT Prompt) (Rashad, 2023), reasoning questions (Web QA) (Berant et al., 2013), physical commonsense reasoning questions (PIQA) (Bisk et al., 2020), scraped conversations with Chat GPT (Share GPT) (Rashad, 2023; Ryoko AI, 2023), summarization tasks (XSum) (Narayan et al., 2018), grade school math problems (GSM8K) (Cobbe et al., 2021), and German to English translation (WMT De En) (Bojar et al., 2014). For Vicuna family of models (Chiang et al., 2023), we conduct the set of experiments in Spec-Bench (Xia et al., 2024). |
| Dataset Splits | No | The paper mentions using prompts from various datasets and decoding the first 1000 prompts, but it does not specify how these datasets themselves were split into training, validation, or test sets for the purpose of model training or evaluation in a way that allows reproduction of data partitioning. |
| Hardware Specification | Yes | For all experiments in this section, we use a single NVIDIA H100 GPU with a batch size of 1 and a max generation length of 1024. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experimental setup. |
| Experiment Setup | Yes | For the experiments on PALM-2 models, we use PALM-2-S as the large target model and PALM-2-XXS / PALM-2-XXXS as the small drafter model. For all datasets, we decode the first 1000 prompts using a max input prompt length of 512 and decode up to 128 output tokens. We use a batch size of 1 in all experiments... We use a temperature of 1.0 for the experiments on PALM-2 models. For Vicuna family of models... We use Vincuna-7B-v1.3 as the target model and Vincuna-68M as the draft model. To study the effect of temperature, we consider temperatures in {0.2, 0.6, 1.0} and fix γ = 8. |