Dynamic-Width Speculative Beam Decoding for LLM Inference

Authors: Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our approach achieves a 1.5-1.9 speed-up and 1.8-2.5 smaller energy consumption than beam sampling, without sacrificing performance on downstream tasks. Besides, it can produce significantly higher-quality outputs than speculative decoding, while maintaining comparable time, memory, and energy costs.
Researcher Affiliation Academia University of California Los Angeles, CA, USA EMAIL
Pseudocode Yes Algorithm 1: Draft and Verification for Speculative Beam Sampling
Open Source Code Yes Our code is open source1. 1https://github.com/Zongyue Qin/DSBD
Open Datasets Yes We use public datasets: SQu AD (Rajpurkar, Jia, and Liang 2018), Spider (Yu et al. 2018), and MTBench (Zheng et al. 2023).
Dataset Splits No The paper uses public datasets: SQu AD (Rajpurkar, Jia, and Liang 2018), Spider (Yu et al. 2018), and MTBench (Zheng et al. 2023). However, it does not explicitly provide details about training/test/validation dataset splits, percentages, or methodology for these datasets in the main text.
Hardware Specification No We use Llama-2-13B, Llama-3.1-8B, and OPT-13B as the large models as they are the largest models our GPU could run.
Software Dependencies No The paper mentions various LLM architectures and models like transformer (Vaswani et al. 2017), GPT-4 (Achiam et al. 2023), Llama-3 (AI@Meta 2024), PALM (Anil et al. 2023), OPT (Zhang et al. 2022), Llama-2 (Touvron et al. 2023), and Llama-68M (Miao et al. 2023). However, it does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup Yes The width of beam sampling ranges from 1 to 4. For our method, we vary the draft beam width WS {2, 3, 4, 5, 6}, the threshold t {0.7, 0.9}, and set Wmin {1, 2, 3}.