Fast Large Language Model Collaborative Decoding via Speculation
Authors: Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate Co S is 1.11x 2.23x faster than standard collaborative decoding without compromising generation quality. ... Experimentally, we conduct extensive experiments across various tasks, including code generation, mathematical reasoning, multi-task understanding, and text summarization. Our evaluation covers multiple LLM pairs, including Llama, Vicuna, and Qwen series, under both two-model and three-model configurations. |
| Researcher Affiliation | Academia | 1Southeast University 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China. Correspondence to: Xu Yang <xuyang EMAIL>. |
| Pseudocode | Yes | The pseudocode for this framework is provided in Algorithm 1. ... The corresponding pseudocode is provided in Algorithm 2. |
| Open Source Code | Yes | Our code is available at https: //github.com/Kamichanw/Co S/. |
| Open Datasets | Yes | We test Co S across multiple tasks including code generation, mathematical reasoning, multitask understanding, and text summarization on Human Eval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021), and CNNDM (See et al., 2017), respectively. |
| Dataset Splits | No | The paper mentions using well-known datasets like Human Eval, GSM8K, MMLU, and CNNDM but does not provide specific details on how these datasets were split into training, validation, or test sets, nor does it refer to specific predefined splits with citations for reproducibility. |
| Hardware Specification | Yes | All experiments are conducted on RTX 3090, except for evaluations involving the Llama-Vicuna model pair, which use the A6000 GPU. Additionally, we also test on the Ascend 910B3 NPU; the corresponding results are shown in Table 9 and Table 10. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions) used for the implementation. |
| Experiment Setup | Yes | For WE, in the two-model case, we set λ = 0.5 and temperature T = 1; in the three-model case, each model s coefficient was set to 1/3. For CD, we set µ = 0.1, which is the most common setting, and set T to both 0 and 1. ... We tested γ = 5 and γ = 1 for Co S and SD speeds, reporting the optimal results. ... In the three-model Co S, all models have a proposal length of 1. |