Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization
Authors: Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Alibaba Group 3Zhejiang University 4Mo E Key Lab of BIPC, University of Science and Technology of China EMAIL |
| Pseudocode | Yes | Figure 9: Pseudocode for our proposed Dr. DPO, as well as the original DPO objective. |
| Open Source Code | Yes | The code is available at https://github.com/junkangwu/Dr_DPO. |
| Open Datasets | Yes | We conduct experiments on two datasets: IMDB (Maas et al., 2011) and Anthropic HH (Bai et al., 2022). |
| Dataset Splits | No | The paper mentions training and test sets and introduces noise into the training data, for example, 'To test the model s resilience to noise, we introduced random inversions between selected and rejected responses in the training data at varying noise levels specifically, with probabilities of 10%, 20%, 30%, and 40%.' and 'The Win-Rate computation is specifically designed for the single-turn dialogue portion of HH dataset s test subset.' However, it does not explicitly provide specific percentages, sample counts, or citations to predefined train/test/validation splits for the datasets used. |
| Hardware Specification | Yes | We carried out all computational tasks on a suite of four 80GB A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Pythia 2.8B model' and 'GPT-2-large' and 'Si EBERT', but does not provide specific version numbers for the underlying software libraries or dependencies like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | Our training regimen was in line with the DPO-established protocol (Rafailov et al., 2023a). We built upon the Pythia 2.8B model, as described in (Biderman et al., 2023), to develop our Supervised Fine-Tuning (SFT) model. The SFT model was fine-tuned on the Anthropic HH dataset over the course of one epoch, employing a batch size of 64 and a learning rate of 5 10 7. In addition, we further refined the model using the Anthropic HH dataset and the DPO loss function (or other baseline approaches) through an additional epoch of fine-tuning. To test the model s resilience to noise, we introduced random inversions between selected and rejected responses in the training data at varying noise levels specifically, with probabilities of 10%, 20%, 30%, and 40%. Throughout these experiments, we consistently set the β parameter to 0.1 and adopted the Kullback-Leibler (KL) divergence as the metric for ϕ-divergence. |