polybasic Speculative Decoding Through a Theoretical Perspective
Authors: Ruilin Wang, Huixia Li, Yuexiao Ma, Xiawu Zheng, Fei Chao, Xuefeng Xiao, Rongrong Ji
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from 3.31 to 4.01 for LLa MA2-Chat 7B, up to 3.87 for LLa MA3-8B, up to 4.43 for Vicuna7B and up to 3.85 for Qwen2-7B all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding. |
| Researcher Affiliation | Collaboration | 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China 2Byte Dance 3Institute of Artificial Intelligence, Xiamen University 4Peng Cheng Laboratory, Shenzhen, China. |
| Pseudocode | Yes | Algorithm 1 Polybasic Speculative Decoding (Three Models) |
| Open Source Code | Yes | We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding. |
| Open Datasets | Yes | We evaluated our multi-model speculative system in Spec Bench(Xia et al., 2024), across multiple tasks including multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation, employing the MT-bench (Zheng et al., 2023), WMT14 DE-EN, CNN/Daily Mail (Nallapati et al., 2016), Natural Questions (Kwiatkowski et al., 2019), GSM8K (Cobbe et al., 2021), and DPR (Karpukhin et al., 2020). |
| Dataset Splits | Yes | We evaluated our multi-model speculative system in Spec Bench(Xia et al., 2024), across multiple tasks including multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation, employing the MT-bench (Zheng et al., 2023), WMT14 DE-EN, CNN/Daily Mail (Nallapati et al., 2016), Natural Questions (Kwiatkowski et al., 2019), GSM8K (Cobbe et al., 2021), and DPR (Karpukhin et al., 2020). |
| Hardware Specification | Yes | Our experiments run on NVIDIA A800 80G GPUs. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | Speculative sampling (Leviathan et al., 2023) conducted experiments with a batch size of 1, similarly, the majority of our experiments also adopted this setting. For the intermediate model, we adopt 4-bit quantization (Ma et al., 2024) with a group size of 128, balancing reduced inference cost against quality. Draft models are built following EAGLE2, trained on Share GPT data. |