reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fast Large Language Model Collaborative Decoding via Speculation

Authors: Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate Co S is 1.11x 2.23x faster than standard collaborative decoding without compromising generation quality. ... Experimentally, we conduct extensive experiments across various tasks, including code generation, mathematical reasoning, multi-task understanding, and text summarization. Our evaluation covers multiple LLM pairs, including Llama, Vicuna, and Qwen series, under both two-model and three-model configurations.
Researcher Affiliation	Academia	1Southeast University 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China. Correspondence to: Xu Yang <xuyang EMAIL>.
Pseudocode	Yes	The pseudocode for this framework is provided in Algorithm 1. ... The corresponding pseudocode is provided in Algorithm 2.
Open Source Code	Yes	Our code is available at https: //github.com/Kamichanw/Co S/.
Open Datasets	Yes	We test Co S across multiple tasks including code generation, mathematical reasoning, multitask understanding, and text summarization on Human Eval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021), and CNNDM (See et al., 2017), respectively.
Dataset Splits	No	The paper mentions using well-known datasets like Human Eval, GSM8K, MMLU, and CNNDM but does not provide specific details on how these datasets were split into training, validation, or test sets, nor does it refer to specific predefined splits with citations for reproducibility.
Hardware Specification	Yes	All experiments are conducted on RTX 3090, except for evaluations involving the Llama-Vicuna model pair, which use the A6000 GPU. Additionally, we also test on the Ascend 910B3 NPU; the corresponding results are shown in Table 9 and Table 10.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions) used for the implementation.
Experiment Setup	Yes	For WE, in the two-model case, we set λ = 0.5 and temperature T = 1; in the three-model case, each model s coefficient was set to 1/3. For CD, we set µ = 0.1, which is the most common setting, and set T to both 0 and 1. ... We tested γ = 5 and γ = 1 for Co S and SD speeds, reporting the optimal results. ... In the three-model Co S, all models have a proposal length of 1.