Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

Authors: Ziyao Wang, Muneeza Azmat, Ang Li, Raya Horesh, Mikhail Yurochkin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Co SD improves accuracy by up to 10% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, University of Maryland, College Park, USA, 2IBM Research, Yorktown Heights, USA.
Pseudocode Yes Algorithm 1 Workflow of COSD
Open Source Code Yes Our code has been released at https: //github.com/ATP-1010/Co SD.
Open Datasets Yes For all the scenarios and model pairs, we use MMLU (Hendrycks et al., 2020), GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021), Hellaswag (Zellers et al., 2019), and Truthful QA (Lin et al., 2021) as the evaluation benchmark.
Dataset Splits No The paper refers to using 'tiny Benchmarks (Polo et al., 2024)' for evaluation and 'randomly select three samples from the Alpaca Eval dataset' for training the decision tree, but does not provide explicit details on the dataset splits (e.g., percentages, exact counts) for the main experimental benchmarks in the main text.
Hardware Specification No The paper mentions "Token latency represents the averaged time to generate one token" and discusses efficiency, but does not specify any particular hardware (GPU/CPU models, specific processors, or detailed cloud instances) used for these experiments.
Software Dependencies No The paper mentions various language models and refers to the Hugging Face repository, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, or other libraries) used in their experimental setup.
Experiment Setup Yes For Rule-Based COSD, we set α = 0.5 and β = 0.5, which were determined to be the optimal and most transferable parameters based on our analysis in Figure 2. For Tree-Based COSD, we randomly select three samples from the Alpaca Eval dataset to train the decision tree.