reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Accelerating Large Language Model Reasoning via Speculative Search

Authors: Zhihai Wang, Jie Wang, Jilai Pan, Xilin Xia, Huiling Zhen, Mingxuan Yuan, Jianye Hao, Feng Wu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on both the Qwen and Llama models demonstrate that Spec Search significantly outperforms state-of-the-art approaches, achieving up to 2.12 speedup with comparable reasoning quality.
Researcher Affiliation	Collaboration	1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China 2Noah s Ark Lab, Huawei Technologies 3College of Intelligence and Computing, Tianjin University. Correspondence to: Jie Wang <EMAIL>.
Pseudocode	Yes	The procedure is summarized in Algorithm 1. ... Furthermore, we further present the complete Spec Search algorithm, which is based on the beam search algorithm, as shown in Algorithm 2.
Open Source Code	Yes	Code is available at https://github.com/MIRALab-USTC/ LLMReasoning-Spec Search.
Open Datasets	Yes	We use two well-established mathematical problem datasets, GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), to evaluate the acceleration performance of the proposed framework.
Dataset Splits	No	We randomly select 100 samples from both the GSM8K and MATH datasets for evaluation. ... For the ablation study... we select 50 mathematical problems from the MATH dataset as the test set for the ablation study. This indicates random sampling without specific seeds or methods to exactly reproduce the specific splits used.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, memory, or exact computing environments) are provided. The paper only mentions the models used: 'We use quantized Qwen2.5-72B-Instruct and Qwen2.5-7B-Instruct (Team, 2024) as large and small models, respectively, along with quantized Llama3-70B-Instruct and Llama3-8B-Instruct (Dubey et al., 2024).'
Software Dependencies	No	The paper mentions using the 'v LLM (Kwon et al., 2023) package' and 'Open R' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	Unless stated otherwise, experiments follow Open R (Wang et al., 2024a) settings: tree width of 6, tree depth of 50, MATH-psa as the process reward model (PRM), Qwen models as the main LLMs, and beam search as the main search algorithm. Throughout all experiments, we set the EMA weight θ in Spec Search to 0.9