Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Authors: Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Among our findings is that simply scaling up a minimalist implementation of samplingbased search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of samplingbased search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving selfverification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.Table 1 summarizes our first finding: that, with effective self-verification, simply scaling sampling-based search is sufficient to approach state-of-art performance on reasoning and math benchmarks (AIME 2024 (MAA, 2024), Live Bench Math, Live Bench Reasoning (White et al., 2024), and the Berkeley MATH dataset (Hendrycks et al., 2021)).
Researcher Affiliation Collaboration Eric Zhao 1 2 Pranjal Awasthi 1 Sreenivas Gollapudi 1 1Google Research 2UC Berkeley. Correspondence to: Eric Zhao <EMAIL>.
Pseudocode Yes Algorithm 1 Sampling-Based Search (Verification@kinf) Input: Prompt Q, model LM, params kinf, kverif, ktie. Populate S with kinf samples from LM( Answer Q ). for each candidate response si S do Let Vi be kverif samples of LM(1[is si correct?]). end for Gather the highest-scored response Avg(Vi) max j [kinf] Avg(Vj) 0.05 . Return SBest if |SBest| = 1. for each (si, sj) in SBest 2 do Let Ci,j be ktie samples of LM( Is si or sj correct ). end for Return round-robin winner si of {Ci,j | si, sj SBest}.
Open Source Code No The paper does not provide a specific link to source code, nor does it explicitly state that code for the methodology is released. It mentions using Google Cloud with Gemini models, but not the release of their own implementation.
Open Datasets Yes Our MATH benchmark consists of 500 questions from the PRM800K (Lightman et al., 2024) test split of MATH (Hendrycks et al., 2021). Our Live Bench Math benchmark consists of 200 random questions from the 368 available as of October 21st 2024, including AMC12 2023, AIME 2024, SMC 2023, USAMO 2023, IMO 2023, and synthetic math questions (White et al., 2024). Our Live Bench Reasoning benchmark consists of 140 questions from the 150 available as of October 21st 2024, including Zebra puzzles, Web-Of-Lies, and Spatial reasoning (White et al., 2024). Our AIME benchmark consists of the 15 questions in the 2024 Exam II (MAA, 2024).
Dataset Splits Yes Our MATH benchmark consists of 500 questions from the PRM800K (Lightman et al., 2024) test split of MATH (Hendrycks et al., 2021). Our Live Bench Math benchmark consists of 200 random questions from the 368 available as of October 21st 2024, including AMC12 2023, AIME 2024, SMC 2023, USAMO 2023, IMO 2023, and synthetic math questions (White et al., 2024). Our Live Bench Reasoning benchmark consists of 140 questions from the 150 available as of October 21st 2024, including Zebra puzzles, Web-Of-Lies, and Spatial reasoning (White et al., 2024). Our AIME benchmark consists of the 15 questions in the 2024 Exam II (MAA, 2024).
Hardware Specification No The paper states, "All experiments are run on Google Cloud with Gemini v1.5-Pro-002 and Gemini v1.5-Flash-002 models dated to September 2024." This specifies the cloud platform and the language models used, but not the underlying hardware (e.g., specific GPU or CPU models, memory details).
Software Dependencies Yes All experiments are run on Google Cloud with Gemini v1.5-Pro-002 and Gemini v1.5-Flash-002 models dated to September 2024.
Experiment Setup Yes Unless otherwise specified, the default parameters for our implementation of sampling-based search (Section 3) are kinf = 200, σinf = 1.5, kverif = 50, σverif = 1, and a maximum of 8,192 output tokens per query.