Multi-Branch Self-Drafting for LLM Inference Acceleration
Authors: Zipeng Gao, Qingrong Xia, Tong Xu, Xinyu Duan, Zhi Zheng, Zhefeng Wang, Enhong Chen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across various open-source benchmarks show that our method generates 2.0 to 3.2 tokens per forward step and achieves around 2 times improvement of end-to-end throughput compared to the autoregressive decoding strategy. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Technology, University of Science and Technology of China 2Huawei Cloud |
| Pseudocode | Yes | Algorithm 1: Self-Draft Algorithm |
| Open Source Code | Yes | Code https://github.com/Zip ECHO/Self-Draft |
| Open Datasets | Yes | Evaluation benchmarks. We evaluated the Self Draft decoding strategy on different benchmarks. MTBench (Zheng et al. 2023) is a diverse set of multi-turn dialogue that covers 8 kinds of tasks and 10 questions for each task. GSM-8K (Cobbe et al. 2021) is a high-quality grade school math problems dataset. We randomly sampled 100 questions from the test set and constructed GSM100. For code completion tasks, we apply the whole Human Eval (Chen et al. 2021) dataset and randomly sample 100 questions from the test set of MBPP (Austin et al. 2021) and constructed MBPP-100. Metric. We applied two metrics to evaluate the decoding strategies: throughput (TP) and decoding efficiency (DE). The former is the average number of tokens generated per second, and the latter is the average number of tokens per forward step. All of our experiments were conducted on an NVIDIA A100 GPU, equipped with 40GB of RAM, and capitalized on mixed precision (FP16) to enhance the computational efficiency. Throughout, the inference was carried out using a batch size of one. |
| Dataset Splits | Yes | GSM-8K (Cobbe et al. 2021) is a high-quality grade school math problems dataset. We randomly sampled 100 questions from the test set and constructed GSM100. For code completion tasks, we apply the whole Human Eval (Chen et al. 2021) dataset and randomly sample 100 questions from the test set of MBPP (Austin et al. 2021) and constructed MBPP-100. |
| Hardware Specification | Yes | All of our experiments were conducted on an NVIDIA A100 GPU, equipped with 40GB of RAM, and capitalized on mixed precision (FP16) to enhance the computational efficiency. |
| Software Dependencies | No | The paper mentions the 'Transformers library' but does not specify a version number. No other specific software versions are provided. |
| Experiment Setup | Yes | For speculative decoding, we employed the assisted decoding strategy available in the Transformers library with Llama-68M (Miao et al. 2024) as the draft model and all parameters set to default. For the LADE method, we adopted the configuration described in their paper. For Self Draft, we set both the draft length and the number of draft branches to 6 across all models, and we will discuss the selection of the length and number of draft branches in the next section. The experimental results show that our method outperforms the basic speculative decoding and the LADE method across various models and datasets. On the MT-Bench dataset, our approach achieved more than 56% times and 45% times speed-up for Llama-7b and 13b models, respectively. In the mathematical problem-solving task on the GSM-100 dataset, our approach achieved 80% and 71% decoding throughput improvements for Llama-7B and 13B models, respectively. In the code completion tasks across two datasets and models, our method achieved over 2 times improvements in throughput across the board. Regarding decoding efficiency, our method has realized an improvement that exceeds 2 tokens per forward step across all evaluation datasets and models in general, and reaches up to 3.22 tokens per forward step on Human Eval, thus we established a global gram length of 4. |