Towards Optimal Multi-draft Speculative Decoding

Authors: Zhengmian Hu, Tong Zheng, Vignesh Viswanathan, Ziyi Chen, Ryan Rossi, Yihan Wu, Dinesh Manocha, Heng Huang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The goal of our experiments is to measure the acceptance rates of various MDSD methods on real text distributions and compare them with the theoretical upper bounds. In the previous sections, we analyzed the theoretical acceptance rate α (p, pdraft) for three different draft distributions: sampling with replacement, sampling without replacement, and greedy approach (Section 5). We also discussed some existing verification methods (Appendix A), such as RRS and K-SEQ, whose acceptance rates are expected to be lower than the theoretical upper bound. For K-SEQ, its average acceptance rate αK-SEQ can be derived theoretically (see Appendix A.2 for details). Our efficient computation methods (Section 4) make it possible, for the first time, to obtain the theoretical acceptance rate upper bound of MDSD for vocabulary sizes of thousands. To obtain realistic distributions p and pdraft, we select real-world datasets for various tasks, including Alpaca (Taori et al., 2023) for instruction-following, WMT 14 De-En (Bojar et al., 2014) for translation, and CNN-Daily Mail (Hermann et al., 2015) for summarization. For each task, we use an LLM to generate responses on 1024 data samples, with a maximum length of 128 tokens. We then measure the logits of the target model and the draft model on these generated responses to construct p and pdraft. We evaluated different approaches based on four publicly available large language models, including 1) LLa MA (Touvron et al., 2023), 2) Vicuna (Chiang et al., 2023), the instruction fine-tuned version of LLa MA models, 3) OPT (Zhang et al., 2022), and 4) Qwen2 (Yang et al., 2024a). Specifically, for the LLa MA family, we select LLa MA-7B as the target model and LLa MA-68M as the draft model, which is consistent with previous work (Miao et al., 2024). For the OPT family, we select OPT-6.7B as the target model and OPT-125M as the draft model. Moreover, for the Vicuna family and the Qwen family, we select Vicuna-7B-v1.3 and Qwen2-7B-Instruct as target models, and we use paired draft models provided by EAGEL (Li et al., 2024), with 0.24B parameters and 0.26B parameters, respectively. Unless otherwise specified, we use a default generation temperature of 0.7 and a draft token number of 3. The total computational cost is less than 50 GPU hours on RTXA6000. In the main experiment, we compare the acceptance rates of different MDSD methods across various LLMs and tasks. The results are shown in Table 1.
Researcher Affiliation Collaboration Zhengmian Hu1,2, , Tong Zheng1, , Vignesh Viswanathan2, 3, Ziyi Chen1, Ryan A. Rossi2, Yihan Wu1, Dinesh Manocha1,4, Heng Huang1 1Department of Computer Science, University of Maryland, College Park, MD, USA 2Adobe Research, San Jose, CA, USA 3Manning College of Information & Computer Sciences, University of Massachusetts Amherst, MA, USA 4Department of Electrical and Computer Engineering, University of Maryland, College Park, MD, USA
Pseudocode Yes Pseudo code for apply multi-draft speculative sampling for multiple steps, with arbitrary tree topology. def multidraft_speculative_decoding ( prompt , tree_topology , draft_model , target_model ) : Multi Draft Speculative Decoding algorithm for accelerating language model inference .
Open Source Code No The paper does not contain any explicit statement about releasing code nor provides any links to a code repository. It mentions implementing the Greedy method within the EAGLE Framework but does not state that their specific implementation is open-source or available.
Open Datasets Yes To obtain realistic distributions p and pdraft, we select real-world datasets for various tasks, including Alpaca (Taori et al., 2023) for instruction-following, WMT 14 De-En (Bojar et al., 2014) for translation, and CNN-Daily Mail (Hermann et al., 2015) for summarization.
Dataset Splits No The paper mentions generating responses on '1024 data samples' and using the 'MT-Bench dataset' but does not specify any training, validation, or test splits for any of the datasets used in the experiments.
Hardware Specification Yes The total computational cost is less than 50 GPU hours on RTXA6000.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes Unless otherwise specified, we use a default generation temperature of 0.7 and a draft token number of 3. ... We experiment with three types of tree structures: (1) drafts = 2, depths = 4; (2) drafts = 4, depths = 3; and (3) a sparse tree with up to 4 drafts and 5 steps, which is the default setting in EAGLE.