Mixture of Attentions For Speculative Decoding

Authors: Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to demonstrate the effectiveness of our approach. Compared to EAGLE-2, we show a 9.5% decoding speedup with a 25% higher acceptance rate in a single-device scenario and a 84% speedup with a 53% higher acceptance rate in a client-server scenario.
Researcher Affiliation Collaboration Matthieu Zimmer , Milan Gritta & Gerasimos Lampouras Huawei Noah s Ark Lab, EMAIL Haitham Bou Ammar Huawei Noah s Ark Lab, UCL Centre for Artificial Intelligence Jun Wang UCL Centre for Artificial Intelligence
Pseudocode Yes A.3 ALGORITHM Algorithm 1 Generation algorithm for MOurs Small assuming chain decoding
Open Source Code Yes The source code is publicly available at https://github.com/huawei-noah/HEBO/ tree/mixture-of-attentions/.
Open Datasets Yes We train all MSmall on the Ultrachat dataset (Ding et al., 2023) without a system prompt... We notably relied on the Spec Bench benchmark (Xia et al., 2024) and the following datasets: MT-Bench (Zheng et al., 2023), Human Eval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), Alpaca (Taori et al., 2023), CNN/Daily Mail (Nallapati et al., 2016) and Natural Questions (Kwiatkowski et al., 2019).
Dataset Splits No We train all MSmall on the Ultrachat dataset (Ding et al., 2023) without a system prompt and we do not assume that we know the system prompt at test time... Ultrachat is composed of around 200k prompts with around 240M tokens using the LLama3 tokenizer. We use multiple test datasets for generation including various tasks such as reasoning, code generation, multi-turn conversation and summarisation. While the paper mentions training on Ultrachat and testing on several public datasets, it does not provide specific numerical train/validation/test splits for their experiments on these datasets, nor does it detail how Ultrachat was split for their specific usage.
Hardware Specification No The server has 3 times more float16 tflops than the client. The devices are located in two different cities, separated by 300 km. The ping between the devices is around 9 ms and the bandwidth 50 Mbits/sec. In 4G, we assume a maximum of 20 Mbits/sec with a normally distributed delay of 21 ms ± 19 ms and a 0.1% chance of dropping packets. In 5G, we assume a normally distributed delay of 10 ms ± 10 ms with a 0.1% chance of dropping packets. The paper describes relative computational power (tflops) and network conditions but does not specify exact GPU/CPU models or other hardware components used for the experiments.
Software Dependencies No We implemented our approach in v LLM (Kwon et al., 2023) without tree decoding to support higher batch sizes and continuous batching. The paper mentions using vLLM but does not provide specific version numbers for vLLM or any other software libraries/dependencies (e.g., Python, PyTorch, CUDA versions) used in their implementation.
Experiment Setup Yes A.1 HYPERPARAMETERS Table 6: List of our hyperparameters. Distillation Learning rate for gradient descent 3 × 10−5 Total numbers of transformer updates 186000 Minibatch size 32 Mixed-precision training yes, float16 Weight of reserve KL loss (λ0) 0.1 Weight of L1 smooth loss (λ1) 1.0 L2 gradient clipping 1.0 T-step bounded mask for the CA layer Uniform between 5 to 15