AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Authors: Zhepei Wei, Wei-Lin Chen, Xinyu Zhu, Yu Meng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across diverse generation tasks shows that Ada Decode consistently achieves superior decoding throughput compared to baselines with up to 1.73 speedup, while guaranteeing output parity with standard autoregressive decoding.1
Researcher Affiliation Academia 1University of Virginia. Correspondence to: Zhepei Wei <EMAIL>, Wei-Lin Chen <EMAIL>, Xinyu Zhu <EMAIL>, Yu Meng <EMAIL>.
Pseudocode Yes Algorithm 1: LLM Decoding Acceleration via Adaptive Layer Parallelism (Ada Decode)
Open Source Code Yes Code and artifacts are available at https://github. com/weizhepei/Ada Decode.
Open Datasets Yes We evaluate our method on a diverse set of text generation tasks, including text summarization (i.e., XSum (Narayan et al., 2018)), code generation (i.e., Human Eval (Chen et al., 2021)), and mathematical reasoning (i.e., GSM8K (Cobbe et al., 2021)), covering a broad spectrum of language model capabilities.
Dataset Splits Yes For text summarization, we use the widely adopted extreme summarization (XSum) dataset (Narayan et al., 2018), where the models are prompted to produce a single-sentence summary of a news article, testing their ability to identify and precisely summarize the most salient information in a coherent sentence. Following previous works (Zhang et al., 2024a), we randomly sample 1K instances from the test split for evaluation, and 10K instances from the training split for training the lightweight LM head. For code generation, we evaluate our method on the Human Eval (Chen et al., 2021) benchmark, which assesses Python programming skills through a variety of coding problems, ranging from basic tasks to complex problem-solving challenges. Since the standard Human Eval benchmark does not provide a training set, we use the entire MBPP (Austin et al., 2021) dataset for training, which contains a set of crowd-sourced Python programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality. This results in a total of 974 training samples and 164 test samples for this task. Mathmatical reasoning. We use GSM8K (Cobbe et al., 2021) as the benchmark for mathematical reasoning, which contains diverse grade-school math word problems created by human problem writers. The dataset consists of 7.5K training problems and 1K test problems.
Hardware Specification Yes The lightweight LM heads in our method are trained through full-parameter fine-tuning using the alignment-handbook repository3 with 8 Nvidia A100 GPUs.
Software Dependencies No The framework is implemented using the Hugging Face Transformers library,4 and we set the sampling temperature to zero in all methods for a reproducible comparison. Following Zhang et al. (2024a), the maximum number of new tokens is set to 512.
Experiment Setup Yes By default, our models are trained using the Adam optimizer (Kingma & Ba, 2014) for 100 epochs, with a batch size of 128, a learning rate of 5e-3, and a cosine learning rate schedule with 3% warmup steps. The framework is implemented using the Hugging Face Transformers library,4 and we set the sampling temperature to zero in all methods for a reproducible comparison. Following Zhang et al. (2024a), the maximum number of new tokens is set to 512. The threshold γ in Equation (1) is set to 0.75, and to ensure a balance between the early prediction rate and the rejection rate, we limit the maximum number of early predictions to 5.