reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Preference Optimization through Reward Model Distillation

Authors: Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Peter Shaw, Jonathan Berant

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the results of our experiment in Figure 2. As can be seen in the plot, the more challenging setting is when ρ < 0.5, which corresponds to a sample of preference annotations in which shorter outputs are generally preferred. This distribution shift is more diﬃcult because as mentioned the oracle reward model (trained on human annotations) has a bias in favor of longer outputs (Singhal et al., 2023). Nevertheless we get sizable improvements compared to the reference policy πSFT for all length bias values.
Researcher Affiliation	Industry	Adam Fisch ﬁsch@google.com Google Deep Mind Jacob Eisenstein EMAIL Google Deep Mind Vicky Zayats EMAIL Google Deep Mind Alekh Agarwal EMAIL Google Research Ahmad Beirami EMAIL Google Deep Mind Chirag Nagpal EMAIL Google Research Peter Shaw EMAIL Google Deep Mind Jonathan Berant EMAIL Google Deep Mind
Pseudocode	No	The paper includes mathematical formulations and theoretical propositions, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	No	The paper does not contain an explicit statement about releasing code or provide any links to a code repository.
Open Datasets	Yes	We ﬁrst train an oracle reward model on the TL;DR summarization task (Stiennon et al., 2020; Völske et al., 2017) and relabel all preference pairs with this oracle. This enables us to use the oracle reward model for evaluation, without worrying about the gap to true human preferences. After relabeling, longer responses (where longer is deﬁned as y1 having at least 10% more tokens than y2) are preferred in 61% of the examples. To test the eﬀect of a spurious correlation on preference-based policy optimization, we select a training set of 30K examples from the relabeled data such that the longer output is preferred in ρ fraction of examples, with ρ {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. Each such training set is denoted Dρ. At each Dρ, we compare our approach to DPO (Rafailov et al., 2023) and IPO (Azar et al., 2024), which are currently the most commonly used oﬄine alignment methods. We test the following variants of distillation and pessimism: ... Additionally, we show results for an unbiased setting on TL;DR, as well for an unbiased setting on Anthropic Helpfulness (Bai et al., 2022).
Dataset Splits	Yes	To test the eﬀect of a spurious correlation on preference-based policy optimization, we select a training set of 30K examples from the relabeled data such that the longer output is preferred in ρ fraction of examples, with ρ {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. Each such training set is denoted Dρ. At each Dρ, we compare our approach to DPO (Rafailov et al., 2023) and IPO (Azar et al., 2024), which are currently the most commonly used oﬄine alignment methods. ... We evaluate performance by sampling summaries for test set prompts, evaluating the average reward according to the oracle reward model, and computing the advantage in average reward compared to πSFT (before alignment). We train policies for 10,000 steps with batch size 16 and learning rate 10 6, and reward models for 3k steps with batch size 64 and learning rate 4 10 6. We use the validation set for model selection during policy training and to choose the following hyperparameters.
Hardware Specification	Yes	C Compute resources We train policies on 32 TPU v3 chips and reward models on 16 TPU v3 chips. We obtain roughly 0.1 steps per second when training, for both the policy and reward models.
Software Dependencies	No	The paper mentions models like Palm-2-XS (Anil et al., 2023) and Gemini 1.0 Ultra (Gemini Team, 2024), but does not specify any software libraries or tools with version numbers that would be required for replication.
Experiment Setup	Yes	We train policies for 10,000 steps with batch size 16 and learning rate 10 6, and reward models for 3k steps with batch size 64 and learning rate 4 10 6. We use the validation set for model selection during policy training and to choose the following hyperparameters. For all DPO variants, we sweep over β {.01, .1, 1, 3, 10, 30, 100}. For IPO, we sweep over τ {0.01, 0.1, 1, 3, 5, 10, 25}. For all pessimistic methods we anneal γ = α/β from 10 4 to 10 2 linearly during the 10k training steps (however, in later experiments performed with e-DPO, we found annealing does not aﬀect performance and a constant γ also leads to similar performance, see Figure B.5).