reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

Authors: Ziyao Wang, Muneeza Azmat, Ang Li, Raya Horesh, Mikhail Yurochkin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Co SD improves accuracy by up to 10% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications.
Researcher Affiliation	Collaboration	1Department of Electrical and Computer Engineering, University of Maryland, College Park, USA, 2IBM Research, Yorktown Heights, USA.
Pseudocode	Yes	Algorithm 1 Workflow of COSD
Open Source Code	Yes	Our code has been released at https: //github.com/ATP-1010/Co SD.
Open Datasets	Yes	For all the scenarios and model pairs, we use MMLU (Hendrycks et al., 2020), GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021), Hellaswag (Zellers et al., 2019), and Truthful QA (Lin et al., 2021) as the evaluation benchmark.
Dataset Splits	No	The paper refers to using 'tiny Benchmarks (Polo et al., 2024)' for evaluation and 'randomly select three samples from the Alpaca Eval dataset' for training the decision tree, but does not provide explicit details on the dataset splits (e.g., percentages, exact counts) for the main experimental benchmarks in the main text.
Hardware Specification	No	The paper mentions "Token latency represents the averaged time to generate one token" and discusses efficiency, but does not specify any particular hardware (GPU/CPU models, specific processors, or detailed cloud instances) used for these experiments.
Software Dependencies	No	The paper mentions various language models and refers to the Hugging Face repository, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, or other libraries) used in their experimental setup.
Experiment Setup	Yes	For Rule-Based COSD, we set α = 0.5 and β = 0.5, which were determined to be the optimal and most transferable parameters based on our analysis in Figure 2. For Tree-Based COSD, we randomly select three samples from the Alpaca Eval dataset to train the decision tree.