reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Authors: Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Tan, Zhuoran Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, extensive empirical experiments with LLa MA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts. [...] In this section, we conduct two sets of experiments to demonstrate the efficacy of the proposed bandit framework BANDITSPEC, along with UCBSPEC and EXP3SPEC.
Researcher Affiliation	Collaboration	1National University of Singapore 2Sea AI Lab 3Singapore Management University 4Yale University.
Pseudocode	Yes	Algorithm 1 CANONICAL DECODING [...] Algorithm 2 SPECUATIVE DECODING SUBROUTINE (SPECDECSUB) [...] Algorithm 3 SPECULATIVE DECODING WITH BANDITS (BANDITSPEC) [...] Algorithm 4 UCBSPEC [...] Algorithm 5 EXP3SPEC [...] Algorithm 6 Vanilla Speculative Decoding [...] Algorithm 7 Dynamics of MAB [...] Algorithm 8 BANDITSPEC(UCBSPEC) (Full version of UCBSPEC) [...] Algorithm 9 BANDITSPEC(EXP3SPEC) (Full version of EXP3SPEC) [...] Algorithm 10 BANDITSPEC(EXP3SPEC) (For analysis purpose) [...] Algorithm 11 Dynamics of the OLO Problem with Full Information Feedback
Open Source Code	Yes	The code is accessible via https://github.com/sail-sg/BanditSpec.
Open Datasets	Yes	The experiments are carried out on Spec Bench (Xia et al., 2024), Alpaca (Taori et al., 2023), Code Editor (Guo et al., 2024) and Debug Bench (Tian et al., 2024). [...] For evaluation, we adopt Alpaca (Taori et al., 2023) as the test set, as it covers various topics, thereby simulating a realistic setting with diverse acceptance rates.
Dataset Splits	Yes	To approximate real-world conditions, we randomly sample prompts from the test set to form a batch for inference, with batch sizes ranging from 1 to 50. As our evaluation metric, we measure the throughput improvement relative to the canonical decoding (non-speculative) baseline. Our result is averaged over 16 independent runs to smoothen the hardware-dependent factors.
Hardware Specification	Yes	The experiments are conducted on a single A100 and set batch size as 1. [...] The experiments are conducted on a single A100. [...] In the main paper, the experiments are conducted on a single A100 GPU. In this section, we conduct an additional set of experiment on Ge Force RTX 4090.
Software Dependencies	No	The paper mentions using specific LLMs like LLa MA3-8B-Instruct and Qwen2-7B-Instruct, and draft models like Eagle-1 and Eagle-2, but it does not specify any software dependencies (e.g., Python, PyTorch, or CUDA versions) with version numbers that would be needed to replicate the experiments.
Experiment Setup	Yes	We set the maximum speculation length L as 4, and γ takes values in {0, . . . , 4} where γ = 0 corresponds to the canonical decoding (Algorithm 1). [...] The experiments are conducted on a single A100 and set batch size as 1. [...] To approximate real-world conditions, we randomly sample prompts from the test set to form a batch for inference, with batch sizes ranging from 1 to 50. Our result is averaged over 16 independent runs to smoothen the hardware-dependent factors.