reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Authors: Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pretrained LLMs, where we find verification is crucial for scaling test-time compute. ... Empirical results corroborating theory.
Researcher Affiliation	Academia	1Carnegie Mellon University 2UC Berkeley. Correspondence to: Amrith Setlur <EMAIL>.
Pseudocode	Yes	Algorithm 1 Simple Verifier-Based Algorithm Algorithm 2 Simple Verifier-Based Algorithm with ℓ0/1 loss
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a link to a code repository.
Open Datasets	Yes	We corroborate our theoretical results on math reasoning with 3B/8B Llama models, and the S1 (Muennighoff et al., 2025) model. For the S1 model that is trained in a verifier-free manner, we show that a simple verifier-based approach performs better than S1 across a set of test-time compute budgets (Figure 6). For the LLama models, we explicitly control the heterogeneity of the base LLM and show that VF methods perform poorly with more heterogeneous base LLMs, and that the gap between VB and VF performance scales with more testtime compute (Figure 5 in Section 7). Our investigation also reveals that common pre-trained LLMs are indeed heterogeneous and satisfy anti-concentration, which are abstractions we introduce to prove our theoretical results (Figure 7, 8). To the best our knowledge, this is the first theoretical result and systematic study showing a separation between VF and VB methods, under realistic assumptions on the base model.
Dataset Splits	Yes	We run all our training on the questions in the training set of MATH (Hendrycks et al., 2021), and run our test on the MATH500 evaluation benchmark.
Hardware Specification	No	All experiments in this work were run at Carnegie Mellon University. ... The authors thank the TRC program at Google Cloud and Lambda labs for providing compute resources that supported this work.
Software Dependencies	No	The paper mentions models like GPT2-xl (Radford et al., 2019) and Llama-3.1/3.2 (Dubey et al., 2024), and optimizers like Adam, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	For the RL runs, we use REINFORCE (Ahmadian et al., 2024) train for 20k iterations in both with a batch size of 64, and a constant learning rate of 1e-4, with the Adam optimizer. ... For SFT, we also use the Adam optimizer with a learning rate of 2e-4, and a batch size of 64. Similar to RL, we apply a KL regularization term in addition to the next token prediction loss (ignoring the padding token 0), where the strength of the KL term is the same as RL. SFT runs are also initialized with the base policy. ... We use a batchsize of 32 and learning rate of 1e-6 for all our experiments. We run SFT and verifier training for 10000 iterations on each instance. We use a weight decay of 0.01 for training both.