Block Verification Accelerates Speculative Decoding

Authors: Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Jae H Ro, Ahmad Beirami, Ananda Theertha Suresh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically test block verification and compare it with the standard token verification on a range of tasks and datasets. We show that our algorithm consistently improves over block efficiency (i.e. the expected number of generated tokens) by 7%-10% and overall empirical wall clock times by 5%-8% (see Table 1).
Researcher Affiliation Industry Ziteng Sun Google Research zitengsun@ Uri Mendlovic Google Research urimend@ Yaniv Leviathan Google Research leviathan@ Asaf Aharoni Google Research asafaharoni@ Jae Hun Ro Google Research jaero@ Ahmad Beirami Google Research beirami@ Ananda Theertha Suresh Google Research theertha@ All emails @google.com.
Pseudocode Yes See Algorithm 2 for a sketch implementation of block verification, and Algorithm 1 for a sketch implementation of the standard token verification for comparison. Note that the implementations follow the same overall structure (the differences are highlighted). See Algorithm 3 for the outer loop of the speculative decoding algorithm, which remains unchanged for both verification methods. See Appendix A for sketch Python implementations.
Open Source Code No In this section we provide a sketch implementation of block verification (Algorithm 2) in Python. Note that these are meant for illustration purposes only and are not fit for practical use.
Open Datasets Yes For the experiments on PALM-2 models... We evaluate on prompts from a wide range of datasets and tasks, including language modeling with one-billion language benchmark (LM1B) (Chelba et al., 2013), Chat GPT prompts sourced from Learn GPT (GPT Prompt) (Rashad, 2023), reasoning questions (Web QA) (Berant et al., 2013), physical commonsense reasoning questions (PIQA) (Bisk et al., 2020), scraped conversations with Chat GPT (Share GPT) (Rashad, 2023; Ryoko AI, 2023), summarization tasks (XSum) (Narayan et al., 2018), grade school math problems (GSM8K) (Cobbe et al., 2021), and German to English translation (WMT De En) (Bojar et al., 2014). For Vicuna family of models (Chiang et al., 2023), we conduct the set of experiments in Spec-Bench (Xia et al., 2024).
Dataset Splits No The paper mentions using prompts from various datasets and decoding the first 1000 prompts, but it does not specify how these datasets themselves were split into training, validation, or test sets for the purpose of model training or evaluation in a way that allows reproduction of data partitioning.
Hardware Specification Yes For all experiments in this section, we use a single NVIDIA H100 GPU with a batch size of 1 and a max generation length of 1024.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experimental setup.
Experiment Setup Yes For the experiments on PALM-2 models, we use PALM-2-S as the large target model and PALM-2-XXS / PALM-2-XXXS as the small drafter model. For all datasets, we decode the first 1000 prompts using a max input prompt length of 512 and decode up to 128 output tokens. We use a batch size of 1 in all experiments... We use a temperature of 1.0 for the experiments on PALM-2 models. For Vicuna family of models... We use Vincuna-7B-v1.3 as the target model and Vincuna-68M as the draft model. To study the effect of temperature, we consider temperatures in {0.2, 0.6, 1.0} and fix γ = 8.