Fast Large Language Model Collaborative Decoding via Speculation

Authors: Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate Co S is 1.11x 2.23x faster than standard collaborative decoding without compromising generation quality. ... Experimentally, we conduct extensive experiments across various tasks, including code generation, mathematical reasoning, multi-task understanding, and text summarization. Our evaluation covers multiple LLM pairs, including Llama, Vicuna, and Qwen series, under both two-model and three-model configurations.
Researcher Affiliation Academia 1Southeast University 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China. Correspondence to: Xu Yang <xuyang EMAIL>.
Pseudocode Yes The pseudocode for this framework is provided in Algorithm 1. ... The corresponding pseudocode is provided in Algorithm 2.
Open Source Code Yes Our code is available at https: //github.com/Kamichanw/Co S/.
Open Datasets Yes We test Co S across multiple tasks including code generation, mathematical reasoning, multitask understanding, and text summarization on Human Eval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021), and CNNDM (See et al., 2017), respectively.
Dataset Splits No The paper mentions using well-known datasets like Human Eval, GSM8K, MMLU, and CNNDM but does not provide specific details on how these datasets were split into training, validation, or test sets, nor does it refer to specific predefined splits with citations for reproducibility.
Hardware Specification Yes All experiments are conducted on RTX 3090, except for evaluations involving the Llama-Vicuna model pair, which use the A6000 GPU. Additionally, we also test on the Ascend 910B3 NPU; the corresponding results are shown in Table 9 and Table 10.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions) used for the implementation.
Experiment Setup Yes For WE, in the two-model case, we set λ = 0.5 and temperature T = 1; in the three-model case, each model s coefficient was set to 1/3. For CD, we set µ = 0.1, which is the most common setting, and set T to both 0 and 1. ... We tested γ = 5 and γ = 1 for Co S and SD speeds, reporting the optimal results. ... In the three-model Co S, all models have a proposal length of 1.