reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Authors: Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Among our findings is that simply scaling up a minimalist implementation of samplingbased search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of samplingbased search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving selfverification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.Table 1 summarizes our first finding: that, with effective self-verification, simply scaling sampling-based search is sufficient to approach state-of-art performance on reasoning and math benchmarks (AIME 2024 (MAA, 2024), Live Bench Math, Live Bench Reasoning (White et al., 2024), and the Berkeley MATH dataset (Hendrycks et al., 2021)).
Researcher Affiliation	Collaboration	Eric Zhao 1 2 Pranjal Awasthi 1 Sreenivas Gollapudi 1 1Google Research 2UC Berkeley. Correspondence to: Eric Zhao <EMAIL>.
Pseudocode	Yes	Algorithm 1 Sampling-Based Search (Verification@kinf) Input: Prompt Q, model LM, params kinf, kverif, ktie. Populate S with kinf samples from LM( Answer Q ). for each candidate response si S do Let Vi be kverif samples of LM(1[is si correct?]). end for Gather the highest-scored response Avg(Vi) max j [kinf] Avg(Vj) 0.05 . Return SBest if \|SBest\| = 1. for each (si, sj) in SBest 2 do Let Ci,j be ktie samples of LM( Is si or sj correct ). end for Return round-robin winner si of {Ci,j \| si, sj SBest}.
Open Source Code	No	The paper does not provide a specific link to source code, nor does it explicitly state that code for the methodology is released. It mentions using Google Cloud with Gemini models, but not the release of their own implementation.
Open Datasets	Yes	Our MATH benchmark consists of 500 questions from the PRM800K (Lightman et al., 2024) test split of MATH (Hendrycks et al., 2021). Our Live Bench Math benchmark consists of 200 random questions from the 368 available as of October 21st 2024, including AMC12 2023, AIME 2024, SMC 2023, USAMO 2023, IMO 2023, and synthetic math questions (White et al., 2024). Our Live Bench Reasoning benchmark consists of 140 questions from the 150 available as of October 21st 2024, including Zebra puzzles, Web-Of-Lies, and Spatial reasoning (White et al., 2024). Our AIME benchmark consists of the 15 questions in the 2024 Exam II (MAA, 2024).
Dataset Splits	Yes	Our MATH benchmark consists of 500 questions from the PRM800K (Lightman et al., 2024) test split of MATH (Hendrycks et al., 2021). Our Live Bench Math benchmark consists of 200 random questions from the 368 available as of October 21st 2024, including AMC12 2023, AIME 2024, SMC 2023, USAMO 2023, IMO 2023, and synthetic math questions (White et al., 2024). Our Live Bench Reasoning benchmark consists of 140 questions from the 150 available as of October 21st 2024, including Zebra puzzles, Web-Of-Lies, and Spatial reasoning (White et al., 2024). Our AIME benchmark consists of the 15 questions in the 2024 Exam II (MAA, 2024).
Hardware Specification	No	The paper states, "All experiments are run on Google Cloud with Gemini v1.5-Pro-002 and Gemini v1.5-Flash-002 models dated to September 2024." This specifies the cloud platform and the language models used, but not the underlying hardware (e.g., specific GPU or CPU models, memory details).
Software Dependencies	Yes	All experiments are run on Google Cloud with Gemini v1.5-Pro-002 and Gemini v1.5-Flash-002 models dated to September 2024.
Experiment Setup	Yes	Unless otherwise specified, the default parameters for our implementation of sampling-based search (Section 3) are kinf = 200, σinf = 1.5, kverif = 50, σverif = 1, and a maximum of 8,192 output tokens per query.