Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Authors: Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness. Our code is available at https://github.com/AMD-AIG-AIMA/Gumiho. In this section, we compare our proposed Gumiho with other SOTA methods to show the priority of our approach. Then, we conduct several ablation studies to validate the effectiveness of each part of our method.
Researcher Affiliation Collaboration 1Advanced Micro Devices, Inc., Beijing, China 2Department of Electrical and Electronic Engineering, The University of Hong Kong 3Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University. Correspondence to: Edith C.H. Ngai <EMAIL>.
Pseudocode No The paper describes methods and theoretical analysis using mathematical equations and text, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/AMD-AIG-AIMA/Gumiho.
Open Datasets Yes Following Eagle and Eagle-2 (Li et al., 2024a;b), we train our draft model on the Share GPT dataset. Our Gumiho model comprises a Transformer model and five MLPs to predict the next seven draft tokens: the Transformer autoregressively generates the first two tokens, and the remaining five are predicted in parallel by the MLPs. Training details and hyperparameters can be found in Appendix C. We evaluate the performance across multiple benchmarks: MT-Bench (Zheng et al., 2023) for multi-turn dialogue, Human Eval (Chen et al., 2021) for code generation, GSM8K (Cobbe et al., 2021) for mathematical reasoning, Alpaca (Taori et al., 2023) for general instruction-following, CNN/Daily Mail (Nallapati et al., 2016) for summarization, and Natural Questions (Kwiatkowski et al., 2019) for question answering.
Dataset Splits No The paper mentions several datasets for training and evaluation (Share GPT, MT-Bench, Human Eval, GSM8K, Alpaca, CNN/Daily Mail, Natural Questions) but does not specify the training/test/validation splits used for these datasets within the text.
Hardware Specification Yes We conduct model training using 8 AMD Instinct MI250 GPUs. For evaluation, we use a single MI250 GPU for all models except the 70B variant, which requires 4 MI250 GPUs due to its larger size. Additionally, we include evaluation results using a single NVIDIA A100 GPU in Appendix B.
Software Dependencies No The paper mentions using 'Optimizer Adam W' but does not specify versions for other key software components like programming languages, libraries, or frameworks.
Experiment Setup Yes Training details and hyperparameters can be found in Appendix C. Table 5: Hyper-parameter configurations of Gumiho. Learning rate 2e-4 (Vicuna 7B/13B) / 1e-4 (LLa MA2 7B/13B/70B, LLa MA3 70B), Transformer layer number 2, MLP head number 5, Batch size 4, Training epoch 10, Optimizer Adam W, (β1, β2) (0.9, 0.95).