Generative Verifiers: Reward Modeling as Next-Token Prediction

Authors: Lunjun Zhang, Arian Hosseini, Hritik Bansal, Seyed Mehran Kazemi, Aviral Kumar, Rishabh Agarwal

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that Gen RM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge, resulting in large performance gains with Best-of-N, namely 5% 45.3% on algorithmic tasks, 73% 93.4% on GSM8K, and 28% 44.6% on easy-to-hard generalization on MATH. Furthermore, we find that training Gen RM with synthetic verification rationales is sufficient to pick out subtle errors on math problems. Finally, we demonstrate that generative verifiers scale favorably with model size and inference-time compute.
Researcher Affiliation Collaboration 1 Google Deep Mind 2 University of Toronto 3 MILA 4 UCLA 5 CMU EMAIL, EMAIL
Pseudocode No The paper only describes methods and procedures in narrative text and mathematical equations; there are no clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No To ensure our work can be easily reproduced, we use open-weights Gemma models (Gemma Team et al., 2024a;b), and describe our experiment setup thoroughly in 4, with additional details about data collection and processing in Appendix A and hyperparameters in Appendix B. Since Gen RM relies on next token prediction, no additional code is needed beyond supervised fine-tuning. We have also open-sourced our training dataset of synthetic rationales at https://github.com/ genrm-star/genrm-critiques.
Open Datasets Yes Algorithmic reasoning. We use two difficult string manipulation tasks, namely Last Letter Concatenation (Wei et al., 2022) and Word Sorting from Big-Bench (Suzgun et al., 2022). Math reasoning. We train grade-school math verifiers on the GSM8K dataset from Cobbe et al. (2021)... on much harder MATH dataset (Hendrycks et al., 2021). We have also open-sourced our training dataset of synthetic rationales at https://github.com/ genrm-star/genrm-critiques.
Dataset Splits Yes Grade School Math (Cobbe et al., 2021): We follow the original train/test split and use 1.3K problems for test, 128 problems for validation, and about 7.2K problems for training. We generate 50 solutions per problem, and randomly sample at max 16 correct solutions and 16 incorrect solutions per problem as the training set.
Hardware Specification No For training verifiers, we use open-weights Gemma models (Gemma Team et al., 2024a;b), specifically Gemma-2B for algorithmic tasks, and Gemma 2B, 7B, and Gemma-2 9B for GSM8K. For solution generation as well as LLM-as-a-Judge, we use Gemma 2B for algorithmic tasks and Gemini 1.0 Pro (Google et al., 2023) for GSM8K.
Software Dependencies No We use the Adam optimizer (Kingma, 2014) with decoupled weight decay (Loshchilov and Hutter, 2017) and a gradient norm clipping of 1.0. We use a linear warmup of 1000 gradient steps, and a cosine decay schedule that decays to 10% of the peak learning rate after a decay period. We finetune for 300K steps with a batch size of 64 and a cosine decay period of 200K, and use seqio (Roberts et al., 2022) library to create data mixtures.
Experiment Setup Yes Gen RM verifiers After doing a sweep of learning rates (LR), we find that an LR of [2e 6, 1e 6, 5e 7] works well for our tasks considered (with LR=2e 6 generally being the best). We use a weight decay of 1e 2, and do not apply any dropout. We use the Adam optimizer (Kingma, 2014) with decoupled weight decay (Loshchilov and Hutter, 2017) and a gradient norm clipping of 1.0. We use a linear warmup of 1000 gradient steps, and a cosine decay schedule that decays to 10% of the peak learning rate after a decay period. We finetune for 300K steps with a batch size of 64 and a cosine decay period of 200K, and use seqio (Roberts et al., 2022) library to create data mixtures.