AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism
Authors: Zhepei Wei, Wei-Lin Chen, Xinyu Zhu, Yu Meng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across diverse generation tasks shows that Ada Decode consistently achieves superior decoding throughput compared to baselines with up to 1.73 speedup, while guaranteeing output parity with standard autoregressive decoding.1 |
| Researcher Affiliation | Academia | 1University of Virginia. Correspondence to: Zhepei Wei <EMAIL>, Wei-Lin Chen <EMAIL>, Xinyu Zhu <EMAIL>, Yu Meng <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: LLM Decoding Acceleration via Adaptive Layer Parallelism (Ada Decode) |
| Open Source Code | Yes | Code and artifacts are available at https://github. com/weizhepei/Ada Decode. |
| Open Datasets | Yes | We evaluate our method on a diverse set of text generation tasks, including text summarization (i.e., XSum (Narayan et al., 2018)), code generation (i.e., Human Eval (Chen et al., 2021)), and mathematical reasoning (i.e., GSM8K (Cobbe et al., 2021)), covering a broad spectrum of language model capabilities. |
| Dataset Splits | Yes | For text summarization, we use the widely adopted extreme summarization (XSum) dataset (Narayan et al., 2018), where the models are prompted to produce a single-sentence summary of a news article, testing their ability to identify and precisely summarize the most salient information in a coherent sentence. Following previous works (Zhang et al., 2024a), we randomly sample 1K instances from the test split for evaluation, and 10K instances from the training split for training the lightweight LM head. For code generation, we evaluate our method on the Human Eval (Chen et al., 2021) benchmark, which assesses Python programming skills through a variety of coding problems, ranging from basic tasks to complex problem-solving challenges. Since the standard Human Eval benchmark does not provide a training set, we use the entire MBPP (Austin et al., 2021) dataset for training, which contains a set of crowd-sourced Python programming problems designed to be solvable by entry-level programmers, covering programming fundamentals and standard library functionality. This results in a total of 974 training samples and 164 test samples for this task. Mathmatical reasoning. We use GSM8K (Cobbe et al., 2021) as the benchmark for mathematical reasoning, which contains diverse grade-school math word problems created by human problem writers. The dataset consists of 7.5K training problems and 1K test problems. |
| Hardware Specification | Yes | The lightweight LM heads in our method are trained through full-parameter fine-tuning using the alignment-handbook repository3 with 8 Nvidia A100 GPUs. |
| Software Dependencies | No | The framework is implemented using the Hugging Face Transformers library,4 and we set the sampling temperature to zero in all methods for a reproducible comparison. Following Zhang et al. (2024a), the maximum number of new tokens is set to 512. |
| Experiment Setup | Yes | By default, our models are trained using the Adam optimizer (Kingma & Ba, 2014) for 100 epochs, with a batch size of 128, a learning rate of 5e-3, and a cosine learning rate schedule with 3% warmup steps. The framework is implemented using the Hugging Face Transformers library,4 and we set the sampling temperature to zero in all methods for a reproducible comparison. Following Zhang et al. (2024a), the maximum number of new tokens is set to 512. The threshold γ in Equation (1) is set to 0.75, and to ensure a balance between the early prediction rate and the rejection rate, we limit the maximum number of early predictions to 5. |