reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Authors: Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness. Our code is available at https://github.com/AMD-AIG-AIMA/Gumiho. In this section, we compare our proposed Gumiho with other SOTA methods to show the priority of our approach. Then, we conduct several ablation studies to validate the effectiveness of each part of our method.
Researcher Affiliation	Collaboration	1Advanced Micro Devices, Inc., Beijing, China 2Department of Electrical and Electronic Engineering, The University of Hong Kong 3Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University. Correspondence to: Edith C.H. Ngai <EMAIL>.
Pseudocode	No	The paper describes methods and theoretical analysis using mathematical equations and text, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/AMD-AIG-AIMA/Gumiho.
Open Datasets	Yes	Following Eagle and Eagle-2 (Li et al., 2024a;b), we train our draft model on the Share GPT dataset. Our Gumiho model comprises a Transformer model and five MLPs to predict the next seven draft tokens: the Transformer autoregressively generates the first two tokens, and the remaining five are predicted in parallel by the MLPs. Training details and hyperparameters can be found in Appendix C. We evaluate the performance across multiple benchmarks: MT-Bench (Zheng et al., 2023) for multi-turn dialogue, Human Eval (Chen et al., 2021) for code generation, GSM8K (Cobbe et al., 2021) for mathematical reasoning, Alpaca (Taori et al., 2023) for general instruction-following, CNN/Daily Mail (Nallapati et al., 2016) for summarization, and Natural Questions (Kwiatkowski et al., 2019) for question answering.
Dataset Splits	No	The paper mentions several datasets for training and evaluation (Share GPT, MT-Bench, Human Eval, GSM8K, Alpaca, CNN/Daily Mail, Natural Questions) but does not specify the training/test/validation splits used for these datasets within the text.
Hardware Specification	Yes	We conduct model training using 8 AMD Instinct MI250 GPUs. For evaluation, we use a single MI250 GPU for all models except the 70B variant, which requires 4 MI250 GPUs due to its larger size. Additionally, we include evaluation results using a single NVIDIA A100 GPU in Appendix B.
Software Dependencies	No	The paper mentions using 'Optimizer Adam W' but does not specify versions for other key software components like programming languages, libraries, or frameworks.
Experiment Setup	Yes	Training details and hyperparameters can be found in Appendix C. Table 5: Hyper-parameter configurations of Gumiho. Learning rate 2e-4 (Vicuna 7B/13B) / 1e-4 (LLa MA2 7B/13B/70B, LLa MA3 70B), Transformer layer number 2, MLP head number 5, Batch size 4, Training epoch 10, Optimizer Adam W, (β1, β2) (0.9, 0.95).