reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Optimal Multi-draft Speculative Decoding

Authors: Zhengmian Hu, Tong Zheng, Vignesh Viswanathan, Ziyi Chen, Ryan Rossi, Yihan Wu, Dinesh Manocha, Heng Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The goal of our experiments is to measure the acceptance rates of various MDSD methods on real text distributions and compare them with the theoretical upper bounds. In the previous sections, we analyzed the theoretical acceptance rate α (p, pdraft) for three different draft distributions: sampling with replacement, sampling without replacement, and greedy approach (Section 5). We also discussed some existing verification methods (Appendix A), such as RRS and K-SEQ, whose acceptance rates are expected to be lower than the theoretical upper bound. For K-SEQ, its average acceptance rate αK-SEQ can be derived theoretically (see Appendix A.2 for details). Our efficient computation methods (Section 4) make it possible, for the first time, to obtain the theoretical acceptance rate upper bound of MDSD for vocabulary sizes of thousands. To obtain realistic distributions p and pdraft, we select real-world datasets for various tasks, including Alpaca (Taori et al., 2023) for instruction-following, WMT 14 De-En (Bojar et al., 2014) for translation, and CNN-Daily Mail (Hermann et al., 2015) for summarization. For each task, we use an LLM to generate responses on 1024 data samples, with a maximum length of 128 tokens. We then measure the logits of the target model and the draft model on these generated responses to construct p and pdraft. We evaluated different approaches based on four publicly available large language models, including 1) LLa MA (Touvron et al., 2023), 2) Vicuna (Chiang et al., 2023), the instruction fine-tuned version of LLa MA models, 3) OPT (Zhang et al., 2022), and 4) Qwen2 (Yang et al., 2024a). Specifically, for the LLa MA family, we select LLa MA-7B as the target model and LLa MA-68M as the draft model, which is consistent with previous work (Miao et al., 2024). For the OPT family, we select OPT-6.7B as the target model and OPT-125M as the draft model. Moreover, for the Vicuna family and the Qwen family, we select Vicuna-7B-v1.3 and Qwen2-7B-Instruct as target models, and we use paired draft models provided by EAGEL (Li et al., 2024), with 0.24B parameters and 0.26B parameters, respectively. Unless otherwise specified, we use a default generation temperature of 0.7 and a draft token number of 3. The total computational cost is less than 50 GPU hours on RTXA6000. In the main experiment, we compare the acceptance rates of different MDSD methods across various LLMs and tasks. The results are shown in Table 1.
Researcher Affiliation	Collaboration	Zhengmian Hu1,2, , Tong Zheng1, , Vignesh Viswanathan2, 3, Ziyi Chen1, Ryan A. Rossi2, Yihan Wu1, Dinesh Manocha1,4, Heng Huang1 1Department of Computer Science, University of Maryland, College Park, MD, USA 2Adobe Research, San Jose, CA, USA 3Manning College of Information & Computer Sciences, University of Massachusetts Amherst, MA, USA 4Department of Electrical and Computer Engineering, University of Maryland, College Park, MD, USA
Pseudocode	Yes	Pseudo code for apply multi-draft speculative sampling for multiple steps, with arbitrary tree topology. def multidraft_speculative_decoding ( prompt , tree_topology , draft_model , target_model ) : Multi Draft Speculative Decoding algorithm for accelerating language model inference .
Open Source Code	No	The paper does not contain any explicit statement about releasing code nor provides any links to a code repository. It mentions implementing the Greedy method within the EAGLE Framework but does not state that their specific implementation is open-source or available.
Open Datasets	Yes	To obtain realistic distributions p and pdraft, we select real-world datasets for various tasks, including Alpaca (Taori et al., 2023) for instruction-following, WMT 14 De-En (Bojar et al., 2014) for translation, and CNN-Daily Mail (Hermann et al., 2015) for summarization.
Dataset Splits	No	The paper mentions generating responses on '1024 data samples' and using the 'MT-Bench dataset' but does not specify any training, validation, or test splits for any of the datasets used in the experiments.
Hardware Specification	Yes	The total computational cost is less than 50 GPU hours on RTXA6000.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	Unless otherwise specified, we use a default generation temperature of 0.7 and a draft token number of 3. ... We experiment with three types of tree structures: (1) drafts = 2, depths = 4; (2) drafts = 4, depths = 3; and (3) a sparse tree with up to 4 drafts and 5 steps, which is the default setting in EAGLE.