reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Block Verification Accelerates Speculative Decoding

Authors: Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Jae H Ro, Ahmad Beirami, Ananda Theertha Suresh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically test block veriﬁcation and compare it with the standard token veriﬁcation on a range of tasks and datasets. We show that our algorithm consistently improves over block efﬁciency (i.e. the expected number of generated tokens) by 7%-10% and overall empirical wall clock times by 5%-8% (see Table 1).
Researcher Affiliation	Industry	Ziteng Sun Google Research zitengsun@ Uri Mendlovic Google Research urimend@ Yaniv Leviathan Google Research leviathan@ Asaf Aharoni Google Research asafaharoni@ Jae Hun Ro Google Research jaero@ Ahmad Beirami Google Research beirami@ Ananda Theertha Suresh Google Research theertha@ All emails @google.com.
Pseudocode	Yes	See Algorithm 2 for a sketch implementation of block veriﬁcation, and Algorithm 1 for a sketch implementation of the standard token veriﬁcation for comparison. Note that the implementations follow the same overall structure (the differences are highlighted). See Algorithm 3 for the outer loop of the speculative decoding algorithm, which remains unchanged for both veriﬁcation methods. See Appendix A for sketch Python implementations.
Open Source Code	No	In this section we provide a sketch implementation of block veriﬁcation (Algorithm 2) in Python. Note that these are meant for illustration purposes only and are not ﬁt for practical use.
Open Datasets	Yes	For the experiments on PALM-2 models... We evaluate on prompts from a wide range of datasets and tasks, including language modeling with one-billion language benchmark (LM1B) (Chelba et al., 2013), Chat GPT prompts sourced from Learn GPT (GPT Prompt) (Rashad, 2023), reasoning questions (Web QA) (Berant et al., 2013), physical commonsense reasoning questions (PIQA) (Bisk et al., 2020), scraped conversations with Chat GPT (Share GPT) (Rashad, 2023; Ryoko AI, 2023), summarization tasks (XSum) (Narayan et al., 2018), grade school math problems (GSM8K) (Cobbe et al., 2021), and German to English translation (WMT De En) (Bojar et al., 2014). For Vicuna family of models (Chiang et al., 2023), we conduct the set of experiments in Spec-Bench (Xia et al., 2024).
Dataset Splits	No	The paper mentions using prompts from various datasets and decoding the first 1000 prompts, but it does not specify how these datasets themselves were split into training, validation, or test sets for the purpose of model training or evaluation in a way that allows reproduction of data partitioning.
Hardware Specification	Yes	For all experiments in this section, we use a single NVIDIA H100 GPU with a batch size of 1 and a max generation length of 1024.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the experimental setup.
Experiment Setup	Yes	For the experiments on PALM-2 models, we use PALM-2-S as the large target model and PALM-2-XXS / PALM-2-XXXS as the small drafter model. For all datasets, we decode the ﬁrst 1000 prompts using a max input prompt length of 512 and decode up to 128 output tokens. We use a batch size of 1 in all experiments... We use a temperature of 1.0 for the experiments on PALM-2 models. For Vicuna family of models... We use Vincuna-7B-v1.3 as the target model and Vincuna-68M as the draft model. To study the effect of temperature, we consider temperatures in {0.2, 0.6, 1.0} and ﬁx γ = 8.