reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Authors: Shenao Zhang, Zhihan Liu, Boyi Liu, Yufeng Zhang, Yingxiang Yang, Yongfei Liu, Liyu Chen, Tao Sun, Zhaoran Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments across various benchmarks and diverse models demonstrate that our approach consistently boosts DPO by a considerable margin. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
Researcher Affiliation	Collaboration	1Northwestern University 2Byte Dance.
Pseudocode	Yes	Algorithm 1 Theoretical Version of the Reward-Augmented DPO
Open Source Code	Yes	Our code is available at https://github.com/shenao-zhang/ reward-augmented-preference.
Open Datasets	Yes	We adopt the Ultra Feedback (Cui et al., 2023) preference dataset for this experiment. Specifically, Ultra Feedback contains reward values scored by GPT-4 (LLM-as-Judge), which are ranged between 1 and 10 for each of the preference pairs... Alpaca Eval 2.0 (Dubois et al., 2024), MT-Bench (Zheng et al., 2024), and Arena-Hard-Auto (Li et al., 2024b)... GSM8K (Cobbe et al., 2021), GPQA (Rein et al., 2023), MUSR (Sprague et al., 2023), Truthful QA (Lin et al., 2021), BBH (Suzgun et al., 2022), and ARC Challenge (Clark et al., 2018).
Dataset Splits	No	The paper mentions using the "Ultra Feedback" dataset and performing evaluations on "test set" and a "subset of Ultra Feedback", but does not provide explicit details on the training, validation, and test splits (e.g., percentages, counts, or specific methodology) for the experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions the use of 'Adam W optimizer' and various LLMs (e.g., Mistral-7B-Instruct-v0.3, Qwen2-7B-Instruct), but does not specify version numbers for any key software components or libraries.
Experiment Setup	Yes	For hyperparameters, we tune the KL regularization coefficient β within [0.001, 0.01, 0.1] and batch size within [64, 128, 256]. We find that β = 0.01 and a 256 batch size yield the overall best performance for DPO across models. Our method uses the same hyperparameters as DPO. Besides, we adopt the Adam W optimizer (Loshchilov, 2017), with a learning rate of 5e 7 and a warmup ratio of 0.1. Furthermore, we observe that for models such as Qwen2-7B-Instruct and Gemma-2-9B-It on Ultra Feedback, as well as Llama-3-8B-Instruct on on-policy data, both DPO and our proposed method yield improved performance when employing the conservative DPO (c DPO) technique (Mitchell, 2023). Consequently, for these models, we set the label smoothing hyperparameter from the Alignment Handbook (Tunstall et al., 2023a) to 0.3, while keeping it at 0 for the remaining models.