reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Authors: Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan (DK) Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that Adaptix accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications. Our experiments show a 2.5X improvement in decoding speed compared to baselines.
Researcher Affiliation	Academia	1Northwestern University, 2Texas A&M University 3Purdue University, 4North Carolina State University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using text and equations (Eq. 1, 2, 3, 4, 5, 6, 7, 8) and provides a data processing workflow diagram (Figure 2), but does not contain a dedicated pseudocode or algorithm block.
Open Source Code	Yes	Code https://github.com/liuxukun2000/Adaptix
Open Datasets	Yes	Our assessment incorporates the Human Eval (Chen et al. 2021), MT-Bench (Zheng et al. 2023), and Alpaca (Taori et al. 2023) datasets to ascertain general natural language understanding and generation competencies. The first one is built using a portion of the Python pre-training code from The Stack (Kocetkov et al. 2022), comprising about 2.7M Python code samples with a resulting size of 1007MB. The second is constructed using data derived from Ultra Chat (Ding et al. 2023), consisting of around 774K Chat GPT conversations, producing a corpus with a size of 574MB.
Dataset Splits	No	The paper mentions using Human Eval, MT-Bench, and Alpaca datasets for assessment but does not explicitly provide information on how these datasets were split into training, validation, or test sets for the experiments conducted in this paper, nor does it specify using predefined standard splits.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A6000 GPUs, except for the 33B model, which utilizes an NVIDIA H100.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The experiments default to Greedy sampling. Figure 5 presents the results for Vicuna-7B model on the MT-Bench dataset, showing the impact of different MCTS search counts on performance. Figures 4b and 4c show that variations in these parameters [top-p and temperature] have minimal impact on the average accept length.