Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting

Authors: Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, Yunhe Wang

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple benchmarks demonstrate our effectiveness, where Kangaroo achieves walltime speedups up to 2.04 , outperforming Medusa-1 with 88.7% fewer additional parameters.
Researcher Affiliation Industry Huawei Noah s Ark Lab Consumer Business Group, Huawei EMAIL
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.
Open Datasets Yes We conduct experiments on Vicuna [12] models with size of 7B and 13B. ... For Kangaroo, we train the adapter A for 10 epochs with the Adam W [42] optimizer on the Share GPT dataset following Medusa [20].
Dataset Splits Yes We evaluate the acceleration performance with the recently proposed Spec-Bench [22], which consists of six subtasks including Multi-turn Conversation, Translation, Summarization, Question Answering, Mathematical Reasoning and Retrieval-augmented Generation.
Hardware Specification Yes The training of the adapter A for Vicuna-7B takes around 24 hours on 8 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using the Adam W [42] optimizer but does not provide specific version numbers for other software dependencies like programming languages or libraries.
Experiment Setup Yes For Kangaroo, we train the adapter A for 10 epochs with the Adam W [42] optimizer on the Share GPT dataset following Medusa [20]. ... During the inference stage, we set ℓ= 2 for Vicuna-7B and ℓ= 3 for Vicuna-13B. For the single-sequence decoding in Kangaroo, we set γ = 6 and η = 0.6. For the dynamic tree decoding scenario, we set Top-K as 10, and η = 0.4.