Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Authors: Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness. Our code is available at https://github.com/AMD-AIG-AIMA/Gumiho. In this section, we compare our proposed Gumiho with other SOTA methods to show the priority of our approach. Then, we conduct several ablation studies to validate the effectiveness of each part of our method. |
| Researcher Affiliation | Collaboration | 1Advanced Micro Devices, Inc., Beijing, China 2Department of Electrical and Electronic Engineering, The University of Hong Kong 3Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University. Correspondence to: Edith C.H. Ngai <EMAIL>. |
| Pseudocode | No | The paper describes methods and theoretical analysis using mathematical equations and text, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/AMD-AIG-AIMA/Gumiho. |
| Open Datasets | Yes | Following Eagle and Eagle-2 (Li et al., 2024a;b), we train our draft model on the Share GPT dataset. Our Gumiho model comprises a Transformer model and five MLPs to predict the next seven draft tokens: the Transformer autoregressively generates the first two tokens, and the remaining five are predicted in parallel by the MLPs. Training details and hyperparameters can be found in Appendix C. We evaluate the performance across multiple benchmarks: MT-Bench (Zheng et al., 2023) for multi-turn dialogue, Human Eval (Chen et al., 2021) for code generation, GSM8K (Cobbe et al., 2021) for mathematical reasoning, Alpaca (Taori et al., 2023) for general instruction-following, CNN/Daily Mail (Nallapati et al., 2016) for summarization, and Natural Questions (Kwiatkowski et al., 2019) for question answering. |
| Dataset Splits | No | The paper mentions several datasets for training and evaluation (Share GPT, MT-Bench, Human Eval, GSM8K, Alpaca, CNN/Daily Mail, Natural Questions) but does not specify the training/test/validation splits used for these datasets within the text. |
| Hardware Specification | Yes | We conduct model training using 8 AMD Instinct MI250 GPUs. For evaluation, we use a single MI250 GPU for all models except the 70B variant, which requires 4 MI250 GPUs due to its larger size. Additionally, we include evaluation results using a single NVIDIA A100 GPU in Appendix B. |
| Software Dependencies | No | The paper mentions using 'Optimizer Adam W' but does not specify versions for other key software components like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | Training details and hyperparameters can be found in Appendix C. Table 5: Hyper-parameter configurations of Gumiho. Learning rate 2e-4 (Vicuna 7B/13B) / 1e-4 (LLa MA2 7B/13B/70B, LLa MA3 70B), Transformer layer number 2, MLP head number 5, Batch size 4, Training epoch 10, Optimizer Adam W, (β1, β2) (0.9, 0.95). |