BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
Authors: Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Tan, Zhuoran Yang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, extensive empirical experiments with LLa MA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts. [...] In this section, we conduct two sets of experiments to demonstrate the efficacy of the proposed bandit framework BANDITSPEC, along with UCBSPEC and EXP3SPEC. |
| Researcher Affiliation | Collaboration | 1National University of Singapore 2Sea AI Lab 3Singapore Management University 4Yale University. |
| Pseudocode | Yes | Algorithm 1 CANONICAL DECODING [...] Algorithm 2 SPECUATIVE DECODING SUBROUTINE (SPECDECSUB) [...] Algorithm 3 SPECULATIVE DECODING WITH BANDITS (BANDITSPEC) [...] Algorithm 4 UCBSPEC [...] Algorithm 5 EXP3SPEC [...] Algorithm 6 Vanilla Speculative Decoding [...] Algorithm 7 Dynamics of MAB [...] Algorithm 8 BANDITSPEC(UCBSPEC) (Full version of UCBSPEC) [...] Algorithm 9 BANDITSPEC(EXP3SPEC) (Full version of EXP3SPEC) [...] Algorithm 10 BANDITSPEC(EXP3SPEC) (For analysis purpose) [...] Algorithm 11 Dynamics of the OLO Problem with Full Information Feedback |
| Open Source Code | Yes | The code is accessible via https://github.com/sail-sg/BanditSpec. |
| Open Datasets | Yes | The experiments are carried out on Spec Bench (Xia et al., 2024), Alpaca (Taori et al., 2023), Code Editor (Guo et al., 2024) and Debug Bench (Tian et al., 2024). [...] For evaluation, we adopt Alpaca (Taori et al., 2023) as the test set, as it covers various topics, thereby simulating a realistic setting with diverse acceptance rates. |
| Dataset Splits | Yes | To approximate real-world conditions, we randomly sample prompts from the test set to form a batch for inference, with batch sizes ranging from 1 to 50. As our evaluation metric, we measure the throughput improvement relative to the canonical decoding (non-speculative) baseline. Our result is averaged over 16 independent runs to smoothen the hardware-dependent factors. |
| Hardware Specification | Yes | The experiments are conducted on a single A100 and set batch size as 1. [...] The experiments are conducted on a single A100. [...] In the main paper, the experiments are conducted on a single A100 GPU. In this section, we conduct an additional set of experiment on Ge Force RTX 4090. |
| Software Dependencies | No | The paper mentions using specific LLMs like LLa MA3-8B-Instruct and Qwen2-7B-Instruct, and draft models like Eagle-1 and Eagle-2, but it does not specify any software dependencies (e.g., Python, PyTorch, or CUDA versions) with version numbers that would be needed to replicate the experiments. |
| Experiment Setup | Yes | We set the maximum speculation length L as 4, and γ takes values in {0, . . . , 4} where γ = 0 corresponds to the canonical decoding (Algorithm 1). [...] The experiments are conducted on a single A100 and set batch size as 1. [...] To approximate real-world conditions, we randomly sample prompts from the test set to form a batch for inference, with batch sizes ranging from 1 to 50. Our result is averaged over 16 independent runs to smoothen the hardware-dependent factors. |