Scaling Test-Time Compute Without Verification or RL is Suboptimal
Authors: Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pretrained LLMs, where we find verification is crucial for scaling test-time compute. ... Empirical results corroborating theory. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University 2UC Berkeley. Correspondence to: Amrith Setlur <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Simple Verifier-Based Algorithm Algorithm 2 Simple Verifier-Based Algorithm with ℓ0/1 loss |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a link to a code repository. |
| Open Datasets | Yes | We corroborate our theoretical results on math reasoning with 3B/8B Llama models, and the S1 (Muennighoff et al., 2025) model. For the S1 model that is trained in a verifier-free manner, we show that a simple verifier-based approach performs better than S1 across a set of test-time compute budgets (Figure 6). For the LLama models, we explicitly control the heterogeneity of the base LLM and show that VF methods perform poorly with more heterogeneous base LLMs, and that the gap between VB and VF performance scales with more testtime compute (Figure 5 in Section 7). Our investigation also reveals that common pre-trained LLMs are indeed heterogeneous and satisfy anti-concentration, which are abstractions we introduce to prove our theoretical results (Figure 7, 8). To the best our knowledge, this is the first theoretical result and systematic study showing a separation between VF and VB methods, under realistic assumptions on the base model. |
| Dataset Splits | Yes | We run all our training on the questions in the training set of MATH (Hendrycks et al., 2021), and run our test on the MATH500 evaluation benchmark. |
| Hardware Specification | No | All experiments in this work were run at Carnegie Mellon University. ... The authors thank the TRC program at Google Cloud and Lambda labs for providing compute resources that supported this work. |
| Software Dependencies | No | The paper mentions models like GPT2-xl (Radford et al., 2019) and Llama-3.1/3.2 (Dubey et al., 2024), and optimizers like Adam, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For the RL runs, we use REINFORCE (Ahmadian et al., 2024) train for 20k iterations in both with a batch size of 64, and a constant learning rate of 1e-4, with the Adam optimizer. ... For SFT, we also use the Adam optimizer with a learning rate of 2e-4, and a batch size of 64. Similar to RL, we apply a KL regularization term in addition to the next token prediction loss (ignoring the padding token 0), where the strength of the KL term is the same as RL. SFT runs are also initialized with the base policy. ... We use a batchsize of 32 and learning rate of 1e-6 for all our experiments. We run SFT and verifier training for 10000 iterations on each instance. We use a weight decay of 0.01 for training both. |