RRM: Robust Reward Model Training Mitigates Reward Hacking

Authors: Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, Anastasia Makarova, Jeremiah Zhe Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, Mohammad Saleh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on Reward Bench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in Alpaca Eval-2 from 33.46% to 52.49%.
Researcher Affiliation Collaboration Tianqi Liu1 , Wei Xiong2 , Jie Ren1, Lichang Chen3 , Junru Wu1, Rishabh Joshi1, Yang Gao1, Jiaming Shen1, Zhen Qin1, Tianhe Yu1, Daniel Sohn1, Anastasiia Makarova1, Jeremiah Liu1, Yuan Liu1, Bilal Piot1, Abe Ittycheriah1, Aviral Kumar1, Mohammad Saleh1 Google Deep Mind1, University of Illinois Urbana-Champaign2, University of Maryland, College Park3
Pseudocode Yes Algorithm 1 Example Python Code for Data Augmentation
Open Source Code No All code used for training the reward models (RM and RRM) and for running the experiments described in this paper will be made publicly available upon publication.
Open Datasets Yes We study RRM using the preference dataset curated by RLHFlow4 (Dong et al., 2024), which has been used to train a series of strong open-source preference models as evaluated by the Reward-Bench (Lambert et al., 2024). The dataset consists of 700K preference pairs, which is a mixture of HH-RLHF (Bai et al., 2022a), SHP (Ethayarajh et al., 2022), Help Steer (Wang et al., 2023), PKU-Safe RLHF (Ji et al., 2024), Ultra Feedback (Cui et al., 2023), Ultra Interact (Yuan et al., 2024), Distilabel-Capybara (Daniele & Suphavadeeprasit, 2023), and Distilabel-Orca (Lian et al., 2023). We list the data sources and number of examples in Table 1.
Dataset Splits Yes Reward Model Accuracy The test accuracies on Reward-Bench are reported in Table 2. RRM improves Chat Hard and Safety by a clear margin but sacrifices the Reasoning. Regarding Reasoning, we hypothesize that math and coding are less affected by the non-contextual artifacts and we may use other rewards than an LLM because those are objectives like golden answers. On average, RRM improves RM by an absolute 3.54% accuracy gain.
Hardware Specification No Flash-Attention to accelerate the training while applying the Deepspeed Zero Stage 3 to get batch size 16 on each GPU (the global batch size is 128)
Software Dependencies No We train the reward models for 1 epoch using Adam W (Loshchilov, 2017) optimizer with learning rate 1e-6 and batch size 12814.
Experiment Setup Yes We train the reward models for 1 epoch using Adam W (Loshchilov, 2017) optimizer with learning rate 1e-6 and batch size 12814. ... We train the policies for 2 epochs at most using Adam W (Loshchilov, 2017) optimizer with learning rate 2e-7 and a global batch size of 128, where the batch size follows Dong et al. (2024) and the learning rate is decided by grid search.