Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Authors: Shenao Zhang, Zhihan Liu, Boyi Liu, Yufeng Zhang, Yingxiang Yang, Yongfei Liu, Liyu Chen, Tao Sun, Zhaoran Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments across various benchmarks and diverse models demonstrate that our approach consistently boosts DPO by a considerable margin. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
Researcher Affiliation Collaboration 1Northwestern University 2Byte Dance.
Pseudocode Yes Algorithm 1 Theoretical Version of the Reward-Augmented DPO
Open Source Code Yes Our code is available at https://github.com/shenao-zhang/ reward-augmented-preference.
Open Datasets Yes We adopt the Ultra Feedback (Cui et al., 2023) preference dataset for this experiment. Specifically, Ultra Feedback contains reward values scored by GPT-4 (LLM-as-Judge), which are ranged between 1 and 10 for each of the preference pairs... Alpaca Eval 2.0 (Dubois et al., 2024), MT-Bench (Zheng et al., 2024), and Arena-Hard-Auto (Li et al., 2024b)... GSM8K (Cobbe et al., 2021), GPQA (Rein et al., 2023), MUSR (Sprague et al., 2023), Truthful QA (Lin et al., 2021), BBH (Suzgun et al., 2022), and ARC Challenge (Clark et al., 2018).
Dataset Splits No The paper mentions using the "Ultra Feedback" dataset and performing evaluations on "test set" and a "subset of Ultra Feedback", but does not provide explicit details on the training, validation, and test splits (e.g., percentages, counts, or specific methodology) for the experiments.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions the use of 'Adam W optimizer' and various LLMs (e.g., Mistral-7B-Instruct-v0.3, Qwen2-7B-Instruct), but does not specify version numbers for any key software components or libraries.
Experiment Setup Yes For hyperparameters, we tune the KL regularization coefficient β within [0.001, 0.01, 0.1] and batch size within [64, 128, 256]. We find that β = 0.01 and a 256 batch size yield the overall best performance for DPO across models. Our method uses the same hyperparameters as DPO. Besides, we adopt the Adam W optimizer (Loshchilov, 2017), with a learning rate of 5e 7 and a warmup ratio of 0.1. Furthermore, we observe that for models such as Qwen2-7B-Instruct and Gemma-2-9B-It on Ultra Feedback, as well as Llama-3-8B-Instruct on on-policy data, both DPO and our proposed method yield improved performance when employing the conservative DPO (c DPO) technique (Mitchell, 2023). Consequently, for these models, we set the label smoothing hyperparameter from the Alignment Handbook (Tunstall et al., 2023a) to 0.3, while keeping it at 0 for the remaining models.