Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Authors: Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our claims by training process advantage verifiers (PAVs) to measure progress under such provers and show that compared to ORM, they are > 8% more accurate, and 1.5 5 more compute-efficient. Equipped with these insights, our PAVs enable one of the first results showing a 6 gain in sample efficiency for a policy trained using online RL with PRMs vs. ORMs.
Researcher Affiliation Collaboration Amrith Setlur ,1, Chirag Nagpal ,2, Adam Fisch3, Xinyang Geng3, Jacob Eisenstein3, Rishabh Agarwal3, Alekh Agarwal2, Jonathan Berant3, , Aviral Kumar1,3, 1CMU, 2Google Research, 3Google Deep Mind. Equal contribution, Equal advising. Please send correspondence to: EMAIL.
Pseudocode No The paper describes methods and procedures in paragraph form and mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The pretrained model checkpoints for Gemma-2B, 9B, and 27B used in this work are available publicly at https://huggingface.co/google, and the MATH dataset we use from Hendrycks et al. (2021) is also public here: https://github.com/hendrycks/math.
Open Datasets Yes The pretrained model checkpoints for Gemma-2B, 9B, and 27B used in this work are available publicly at https://huggingface.co/google, and the MATH dataset we use from Hendrycks et al. (2021) is also public here: https://github.com/hendrycks/math.
Dataset Splits Yes We compute a 95% confidence interval over the true mean of the test accuracy, at each iterate of the RL training in Figure 7, Figure 15, and for each value of 푁in Figure 8. This mean is computed over the 500 examples in the MATH500 test set.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions the use of an "Adam optimizer" and the "MADE architecture" but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes We finetune each of these on the MATH (Hendrycks et al., 2021) dataset. The finetuning is done for 5000 iterations, with a batchsize of 32, with a maximum learning rate of 5e-6 for 2B, 9B and 5e-7 for 27B policy. We used the Adam optimizer, with a linear warm up and cosine decay learning rate schedule. The linear warm up is done for the first 500 iterations only.