Dynamic-Width Speculative Beam Decoding for LLM Inference
Authors: Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our approach achieves a 1.5-1.9 speed-up and 1.8-2.5 smaller energy consumption than beam sampling, without sacrificing performance on downstream tasks. Besides, it can produce significantly higher-quality outputs than speculative decoding, while maintaining comparable time, memory, and energy costs. |
| Researcher Affiliation | Academia | University of California Los Angeles, CA, USA EMAIL |
| Pseudocode | Yes | Algorithm 1: Draft and Verification for Speculative Beam Sampling |
| Open Source Code | Yes | Our code is open source1. 1https://github.com/Zongyue Qin/DSBD |
| Open Datasets | Yes | We use public datasets: SQu AD (Rajpurkar, Jia, and Liang 2018), Spider (Yu et al. 2018), and MTBench (Zheng et al. 2023). |
| Dataset Splits | No | The paper uses public datasets: SQu AD (Rajpurkar, Jia, and Liang 2018), Spider (Yu et al. 2018), and MTBench (Zheng et al. 2023). However, it does not explicitly provide details about training/test/validation dataset splits, percentages, or methodology for these datasets in the main text. |
| Hardware Specification | No | We use Llama-2-13B, Llama-3.1-8B, and OPT-13B as the large models as they are the largest models our GPU could run. |
| Software Dependencies | No | The paper mentions various LLM architectures and models like transformer (Vaswani et al. 2017), GPT-4 (Achiam et al. 2023), Llama-3 (AI@Meta 2024), PALM (Anil et al. 2023), OPT (Zhang et al. 2022), Llama-2 (Touvron et al. 2023), and Llama-68M (Miao et al. 2023). However, it does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries. |
| Experiment Setup | Yes | The width of beam sampling ranges from 1 to 4. For our method, we vary the draft beam width WS {2, 3, 4, 5, 6}, the threshold t {0.7, 0.9}, and set Wmin {1, 2, 3}. |