reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Authors: Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, Xiao Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43 and 1.50 , compared to auto-regressive decoding and vanilla speculative decoding, respectively. Our code is available at https: //github.com/smart-lty/Parallel Speculative Decoding.
Researcher Affiliation	Collaboration	Tianyu Liu1,2 Yun Li3 Qitan Lv1 Kai Liu3 Jianchen Zhu3 Winston Hu3 Xiao Sun2 1University of Science and Technology of China 2Shanghai AI Laboratory 3Tencent
Pseudocode	Yes	Algorithm 1 Parallel Speculative Decoding with Adaptive Draft Length. Algorithm 2 Parallel Speculative Decoding with Adaptive Draft Length.
Open Source Code	Yes	Our code is available at https: //github.com/smart-lty/Parallel Speculative Decoding.
Open Datasets	Yes	We conduct experiments on various text generation tasks to evaluate the effectiveness of our PEARL, including Human Eval (code generation task) (Chen et al., 2021), GSM8K & MGSM (multilingual arithmetic reasoning task, MGSM is the multilingual translation of GSM8K) (Cobbe et al., 2021; Shi et al.), and MT-bench (multi-round dialogue task) (Zheng et al., 2024).
Dataset Splits	No	For the code generation task, we employ Human Eval (Chen et al., 2021), a famous code generation benchmark which is composed of 164 entries. For arithmetic reasoning and multilingual inference, we employ GSM8K and MGSM (Cobbe et al., 2021; Shi et al.) as the evaluation benchmark. As the GSM8K is the English version of MGSM, we report their results in the same table. For GSM8K, we sample the first 100 entries for evaluation. For the other 10 categories in MGSM, we select 10 entries for each language. For multi-round dialogue, we employ MT-bench(Zheng et al., 2024) as the benchmark. The maximum generation lengths of these tasks are respectively set to 1024, 256, 256, and 256.
Hardware Specification	Yes	All of our experiments including latency measurement, ablation studies, and case studies are conducted on NVIDIA A100-SXM4-80G GPUs.
Software Dependencies	Yes	Their implementation requires transformers version of 4.36.2, while Llama 3.1 requires transformers 4.43.0
Experiment Setup	Yes	In our experiments, all models are loaded in the precision of bfloat-16. Our PEARL does not introduce any additional training, and directly uses these models to evaluate our algorithm. For inference, we use batch size 1, which is commonly used in other speculative decoding works. The maximum generation lengths of these tasks are respectively set to 1024, 256, 256, and 256.