PEARL: Parallel Speculative Decoding with Adaptive Draft Length
Authors: Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, Xiao Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43 and 1.50 , compared to auto-regressive decoding and vanilla speculative decoding, respectively. Our code is available at https: //github.com/smart-lty/Parallel Speculative Decoding. |
| Researcher Affiliation | Collaboration | Tianyu Liu1,2 Yun Li3 Qitan Lv1 Kai Liu3 Jianchen Zhu3 Winston Hu3 Xiao Sun2 1University of Science and Technology of China 2Shanghai AI Laboratory 3Tencent |
| Pseudocode | Yes | Algorithm 1 Parallel Speculative Decoding with Adaptive Draft Length. Algorithm 2 Parallel Speculative Decoding with Adaptive Draft Length. |
| Open Source Code | Yes | Our code is available at https: //github.com/smart-lty/Parallel Speculative Decoding. |
| Open Datasets | Yes | We conduct experiments on various text generation tasks to evaluate the effectiveness of our PEARL, including Human Eval (code generation task) (Chen et al., 2021), GSM8K & MGSM (multilingual arithmetic reasoning task, MGSM is the multilingual translation of GSM8K) (Cobbe et al., 2021; Shi et al.), and MT-bench (multi-round dialogue task) (Zheng et al., 2024). |
| Dataset Splits | No | For the code generation task, we employ Human Eval (Chen et al., 2021), a famous code generation benchmark which is composed of 164 entries. For arithmetic reasoning and multilingual inference, we employ GSM8K and MGSM (Cobbe et al., 2021; Shi et al.) as the evaluation benchmark. As the GSM8K is the English version of MGSM, we report their results in the same table. For GSM8K, we sample the first 100 entries for evaluation. For the other 10 categories in MGSM, we select 10 entries for each language. For multi-round dialogue, we employ MT-bench(Zheng et al., 2024) as the benchmark. The maximum generation lengths of these tasks are respectively set to 1024, 256, 256, and 256. |
| Hardware Specification | Yes | All of our experiments including latency measurement, ablation studies, and case studies are conducted on NVIDIA A100-SXM4-80G GPUs. |
| Software Dependencies | Yes | Their implementation requires transformers version of 4.36.2, while Llama 3.1 requires transformers 4.43.0 |
| Experiment Setup | Yes | In our experiments, all models are loaded in the precision of bfloat-16. Our PEARL does not introduce any additional training, and directly uses these models to evaluate our algorithm. For inference, we use batch size 1, which is commonly used in other speculative decoding works. The maximum generation lengths of these tasks are respectively set to 1024, 256, 256, and 256. |