reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Authors: Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our claims by training process advantage veriﬁers (PAVs) to measure progress under such provers and show that compared to ORM, they are > 8% more accurate, and 1.5 5 more compute-efﬁcient. Equipped with these insights, our PAVs enable one of the ﬁrst results showing a 6 gain in sample efﬁciency for a policy trained using online RL with PRMs vs. ORMs.
Researcher Affiliation	Collaboration	Amrith Setlur ,1, Chirag Nagpal ,2, Adam Fisch3, Xinyang Geng3, Jacob Eisenstein3, Rishabh Agarwal3, Alekh Agarwal2, Jonathan Berant3, , Aviral Kumar1,3, 1CMU, 2Google Research, 3Google Deep Mind. Equal contribution, Equal advising. Please send correspondence to: EMAIL.
Pseudocode	No	The paper describes methods and procedures in paragraph form and mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The pretrained model checkpoints for Gemma-2B, 9B, and 27B used in this work are available publicly at https://huggingface.co/google, and the MATH dataset we use from Hendrycks et al. (2021) is also public here: https://github.com/hendrycks/math.
Open Datasets	Yes	The pretrained model checkpoints for Gemma-2B, 9B, and 27B used in this work are available publicly at https://huggingface.co/google, and the MATH dataset we use from Hendrycks et al. (2021) is also public here: https://github.com/hendrycks/math.
Dataset Splits	Yes	We compute a 95% conﬁdence interval over the true mean of the test accuracy, at each iterate of the RL training in Figure 7, Figure 15, and for each value of 푁in Figure 8. This mean is computed over the 500 examples in the MATH500 test set.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions the use of an "Adam optimizer" and the "MADE architecture" but does not specify version numbers for any software dependencies or libraries.
Experiment Setup	Yes	We ﬁnetune each of these on the MATH (Hendrycks et al., 2021) dataset. The ﬁnetuning is done for 5000 iterations, with a batchsize of 32, with a maximum learning rate of 5e-6 for 2B, 9B and 5e-7 for 27B policy. We used the Adam optimizer, with a linear warm up and cosine decay learning rate schedule. The linear warm up is done for the ﬁrst 500 iterations only.