Scaling Test-Time Compute Without Verification or RL is Suboptimal

Authors: Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pretrained LLMs, where we find verification is crucial for scaling test-time compute. ... Empirical results corroborating theory.
Researcher Affiliation Academia 1Carnegie Mellon University 2UC Berkeley. Correspondence to: Amrith Setlur <EMAIL>.
Pseudocode Yes Algorithm 1 Simple Verifier-Based Algorithm Algorithm 2 Simple Verifier-Based Algorithm with ℓ0/1 loss
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a link to a code repository.
Open Datasets Yes We corroborate our theoretical results on math reasoning with 3B/8B Llama models, and the S1 (Muennighoff et al., 2025) model. For the S1 model that is trained in a verifier-free manner, we show that a simple verifier-based approach performs better than S1 across a set of test-time compute budgets (Figure 6). For the LLama models, we explicitly control the heterogeneity of the base LLM and show that VF methods perform poorly with more heterogeneous base LLMs, and that the gap between VB and VF performance scales with more testtime compute (Figure 5 in Section 7). Our investigation also reveals that common pre-trained LLMs are indeed heterogeneous and satisfy anti-concentration, which are abstractions we introduce to prove our theoretical results (Figure 7, 8). To the best our knowledge, this is the first theoretical result and systematic study showing a separation between VF and VB methods, under realistic assumptions on the base model.
Dataset Splits Yes We run all our training on the questions in the training set of MATH (Hendrycks et al., 2021), and run our test on the MATH500 evaluation benchmark.
Hardware Specification No All experiments in this work were run at Carnegie Mellon University. ... The authors thank the TRC program at Google Cloud and Lambda labs for providing compute resources that supported this work.
Software Dependencies No The paper mentions models like GPT2-xl (Radford et al., 2019) and Llama-3.1/3.2 (Dubey et al., 2024), and optimizers like Adam, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For the RL runs, we use REINFORCE (Ahmadian et al., 2024) train for 20k iterations in both with a batch size of 64, and a constant learning rate of 1e-4, with the Adam optimizer. ... For SFT, we also use the Adam optimizer with a learning rate of 2e-4, and a batch size of 64. Similar to RL, we apply a KL regularization term in addition to the next token prediction loss (ignoring the padding token 0), where the strength of the KL term is the same as RL. SFT runs are also initialized with the base policy. ... We use a batchsize of 32 and learning rate of 1e-6 for all our experiments. We run SFT and verifier training for 10000 iterations on each instance. We use a weight decay of 0.01 for training both.