Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding
Authors: Ziyao Wang, Muneeza Azmat, Ang Li, Raya Horesh, Mikhail Yurochkin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Co SD improves accuracy by up to 10% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, University of Maryland, College Park, USA, 2IBM Research, Yorktown Heights, USA. |
| Pseudocode | Yes | Algorithm 1 Workflow of COSD |
| Open Source Code | Yes | Our code has been released at https: //github.com/ATP-1010/Co SD. |
| Open Datasets | Yes | For all the scenarios and model pairs, we use MMLU (Hendrycks et al., 2020), GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021), Hellaswag (Zellers et al., 2019), and Truthful QA (Lin et al., 2021) as the evaluation benchmark. |
| Dataset Splits | No | The paper refers to using 'tiny Benchmarks (Polo et al., 2024)' for evaluation and 'randomly select three samples from the Alpaca Eval dataset' for training the decision tree, but does not provide explicit details on the dataset splits (e.g., percentages, exact counts) for the main experimental benchmarks in the main text. |
| Hardware Specification | No | The paper mentions "Token latency represents the averaged time to generate one token" and discusses efficiency, but does not specify any particular hardware (GPU/CPU models, specific processors, or detailed cloud instances) used for these experiments. |
| Software Dependencies | No | The paper mentions various language models and refers to the Hugging Face repository, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, or other libraries) used in their experimental setup. |
| Experiment Setup | Yes | For Rule-Based COSD, we set α = 0.5 and β = 0.5, which were determined to be the optimal and most transferable parameters based on our analysis in Figure 2. For Tree-Based COSD, we randomly select three samples from the Alpaca Eval dataset to train the decision tree. |