reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generative Verifiers: Reward Modeling as Next-Token Prediction

Authors: Lunjun Zhang, Arian Hosseini, Hritik Bansal, Seyed Mehran Kazemi, Aviral Kumar, Rishabh Agarwal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that Gen RM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge, resulting in large performance gains with Best-of-N, namely 5% 45.3% on algorithmic tasks, 73% 93.4% on GSM8K, and 28% 44.6% on easy-to-hard generalization on MATH. Furthermore, we find that training Gen RM with synthetic verification rationales is sufficient to pick out subtle errors on math problems. Finally, we demonstrate that generative verifiers scale favorably with model size and inference-time compute.
Researcher Affiliation	Collaboration	1 Google Deep Mind 2 University of Toronto 3 MILA 4 UCLA 5 CMU EMAIL, EMAIL
Pseudocode	No	The paper only describes methods and procedures in narrative text and mathematical equations; there are no clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	To ensure our work can be easily reproduced, we use open-weights Gemma models (Gemma Team et al., 2024a;b), and describe our experiment setup thoroughly in 4, with additional details about data collection and processing in Appendix A and hyperparameters in Appendix B. Since Gen RM relies on next token prediction, no additional code is needed beyond supervised fine-tuning. We have also open-sourced our training dataset of synthetic rationales at https://github.com/ genrm-star/genrm-critiques.
Open Datasets	Yes	Algorithmic reasoning. We use two difficult string manipulation tasks, namely Last Letter Concatenation (Wei et al., 2022) and Word Sorting from Big-Bench (Suzgun et al., 2022). Math reasoning. We train grade-school math verifiers on the GSM8K dataset from Cobbe et al. (2021)... on much harder MATH dataset (Hendrycks et al., 2021). We have also open-sourced our training dataset of synthetic rationales at https://github.com/ genrm-star/genrm-critiques.
Dataset Splits	Yes	Grade School Math (Cobbe et al., 2021): We follow the original train/test split and use 1.3K problems for test, 128 problems for validation, and about 7.2K problems for training. We generate 50 solutions per problem, and randomly sample at max 16 correct solutions and 16 incorrect solutions per problem as the training set.
Hardware Specification	No	For training verifiers, we use open-weights Gemma models (Gemma Team et al., 2024a;b), specifically Gemma-2B for algorithmic tasks, and Gemma 2B, 7B, and Gemma-2 9B for GSM8K. For solution generation as well as LLM-as-a-Judge, we use Gemma 2B for algorithmic tasks and Gemini 1.0 Pro (Google et al., 2023) for GSM8K.
Software Dependencies	No	We use the Adam optimizer (Kingma, 2014) with decoupled weight decay (Loshchilov and Hutter, 2017) and a gradient norm clipping of 1.0. We use a linear warmup of 1000 gradient steps, and a cosine decay schedule that decays to 10% of the peak learning rate after a decay period. We finetune for 300K steps with a batch size of 64 and a cosine decay period of 200K, and use seqio (Roberts et al., 2022) library to create data mixtures.
Experiment Setup	Yes	Gen RM verifiers After doing a sweep of learning rates (LR), we find that an LR of [2e 6, 1e 6, 5e 7] works well for our tasks considered (with LR=2e 6 generally being the best). We use a weight decay of 1e 2, and do not apply any dropout. We use the Adam optimizer (Kingma, 2014) with decoupled weight decay (Loshchilov and Hutter, 2017) and a gradient norm clipping of 1.0. We use a linear warmup of 1000 gradient steps, and a cosine decay schedule that decays to 10% of the peak learning rate after a decay period. We finetune for 300K steps with a batch size of 64 and a cosine decay period of 200K, and use seqio (Roberts et al., 2022) library to create data mixtures.