Robust Preference Optimization through Reward Model Distillation
Authors: Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Peter Shaw, Jonathan Berant
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present the results of our experiment in Figure 2. As can be seen in the plot, the more challenging setting is when ρ < 0.5, which corresponds to a sample of preference annotations in which shorter outputs are generally preferred. This distribution shift is more difficult because as mentioned the oracle reward model (trained on human annotations) has a bias in favor of longer outputs (Singhal et al., 2023). Nevertheless we get sizable improvements compared to the reference policy πSFT for all length bias values. |
| Researcher Affiliation | Industry | Adam Fisch fisch@google.com Google Deep Mind Jacob Eisenstein EMAIL Google Deep Mind Vicky Zayats EMAIL Google Deep Mind Alekh Agarwal EMAIL Google Research Ahmad Beirami EMAIL Google Deep Mind Chirag Nagpal EMAIL Google Research Peter Shaw EMAIL Google Deep Mind Jonathan Berant EMAIL Google Deep Mind |
| Pseudocode | No | The paper includes mathematical formulations and theoretical propositions, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or provide any links to a code repository. |
| Open Datasets | Yes | We first train an oracle reward model on the TL;DR summarization task (Stiennon et al., 2020; Völske et al., 2017) and relabel all preference pairs with this oracle. This enables us to use the oracle reward model for evaluation, without worrying about the gap to true human preferences. After relabeling, longer responses (where longer is defined as y1 having at least 10% more tokens than y2) are preferred in 61% of the examples. To test the effect of a spurious correlation on preference-based policy optimization, we select a training set of 30K examples from the relabeled data such that the longer output is preferred in ρ fraction of examples, with ρ {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. Each such training set is denoted Dρ. At each Dρ, we compare our approach to DPO (Rafailov et al., 2023) and IPO (Azar et al., 2024), which are currently the most commonly used offline alignment methods. We test the following variants of distillation and pessimism: ... Additionally, we show results for an unbiased setting on TL;DR, as well for an unbiased setting on Anthropic Helpfulness (Bai et al., 2022). |
| Dataset Splits | Yes | To test the effect of a spurious correlation on preference-based policy optimization, we select a training set of 30K examples from the relabeled data such that the longer output is preferred in ρ fraction of examples, with ρ {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. Each such training set is denoted Dρ. At each Dρ, we compare our approach to DPO (Rafailov et al., 2023) and IPO (Azar et al., 2024), which are currently the most commonly used offline alignment methods. ... We evaluate performance by sampling summaries for test set prompts, evaluating the average reward according to the oracle reward model, and computing the advantage in average reward compared to πSFT (before alignment). We train policies for 10,000 steps with batch size 16 and learning rate 10 6, and reward models for 3k steps with batch size 64 and learning rate 4 10 6. We use the validation set for model selection during policy training and to choose the following hyperparameters. |
| Hardware Specification | Yes | C Compute resources We train policies on 32 TPU v3 chips and reward models on 16 TPU v3 chips. We obtain roughly 0.1 steps per second when training, for both the policy and reward models. |
| Software Dependencies | No | The paper mentions models like Palm-2-XS (Anil et al., 2023) and Gemini 1.0 Ultra (Gemini Team, 2024), but does not specify any software libraries or tools with version numbers that would be required for replication. |
| Experiment Setup | Yes | We train policies for 10,000 steps with batch size 16 and learning rate 10 6, and reward models for 3k steps with batch size 64 and learning rate 4 10 6. We use the validation set for model selection during policy training and to choose the following hyperparameters. For all DPO variants, we sweep over β {.01, .1, 1, 3, 10, 30, 100}. For IPO, we sweep over τ {0.01, 0.1, 1, 3, 5, 10, 25}. For all pessimistic methods we anneal γ = α/β from 10 4 to 10 2 linearly during the 10k training steps (however, in later experiments performed with e-DPO, we found annealing does not affect performance and a constant γ also leads to similar performance, see Figure B.5). |