Adaptive Draft-Verification for Efficient Large Language Model Decoding

Authors: Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan (DK) Xu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that Adaptix accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications. Our experiments show a 2.5X improvement in decoding speed compared to baselines.
Researcher Affiliation Academia 1Northwestern University, 2Texas A&M University 3Purdue University, 4North Carolina State University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using text and equations (Eq. 1, 2, 3, 4, 5, 6, 7, 8) and provides a data processing workflow diagram (Figure 2), but does not contain a dedicated pseudocode or algorithm block.
Open Source Code Yes Code https://github.com/liuxukun2000/Adaptix
Open Datasets Yes Our assessment incorporates the Human Eval (Chen et al. 2021), MT-Bench (Zheng et al. 2023), and Alpaca (Taori et al. 2023) datasets to ascertain general natural language understanding and generation competencies. The first one is built using a portion of the Python pre-training code from The Stack (Kocetkov et al. 2022), comprising about 2.7M Python code samples with a resulting size of 1007MB. The second is constructed using data derived from Ultra Chat (Ding et al. 2023), consisting of around 774K Chat GPT conversations, producing a corpus with a size of 574MB.
Dataset Splits No The paper mentions using Human Eval, MT-Bench, and Alpaca datasets for assessment but does not explicitly provide information on how these datasets were split into training, validation, or test sets for the experiments conducted in this paper, nor does it specify using predefined standard splits.
Hardware Specification Yes All experiments are conducted on NVIDIA A6000 GPUs, except for the 33B model, which utilizes an NVIDIA H100.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The experiments default to Greedy sampling. Figure 5 presents the results for Vicuna-7B model on the MT-Bench dataset, showing the impact of different MCTS search counts on performance. Figures 4b and 4c show that variations in these parameters [top-p and temperature] have minimal impact on the average accept length.