AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization
Authors: Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Alpha DPO achieves state-of-the-art performance on Alpaca Eval 2 (58.7% LC win rate) and Arena-Hard (35.7% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training. |
| Researcher Affiliation | Collaboration | 1Mo E Key Lab of BIPC, University of Science and Technology of China 2Alibaba Group. Correspondence to: Xiang Wang <EMAIL>, Xiangnan He <EMAIL>. |
| Pseudocode | No | The paper describes methods in prose and mathematical formulations within the "Proposed Method: Alpha DPO" section, but does not contain a dedicated pseudocode block or algorithm listing. |
| Open Source Code | Yes | The code is available at https://github.com/junkangwu/alpha-DPO. |
| Open Datasets | Yes | For a fair comparison, we use the same training data as Sim PO: princeton-nlp/llama3-ultrafeedback-armorm1, princeton-nlp/mistral-instruct-ultrafeedback2, and princeton-nlp/gemma2-ultrafeedback-armorm 3 for Llama3-8B, Mistral2-7B, and Gemma2-9B, respectively. Additionally, the v0.2 Llama3-Instruct setup uses RLHFlow/Armo RM-Llama3-8B-v0.1 (Wang et al., 2024b) |
| Dataset Splits | No | The paper states, 'For a fair comparison, we use the same training data as Sim PO: princeton-nlp/llama3-ultrafeedback-armorm1, princeton-nlp/mistral-instruct-ultrafeedback2, and princeton-nlp/gemma2-ultrafeedback-armorm 3 for Llama3-8B, Mistral2-7B, and Gemma2-9B, respectively.' However, it does not explicitly provide details about training, validation, or test dataset splits for these datasets. |
| Hardware Specification | Yes | All training experiments presented in this paper were conducted using 8 A100 GPUs, as per the procedures detailed in the alignment-handbook repository. |
| Software Dependencies | No | The paper mentions 'Adam was used as the optimizer (Kingma & Ba, 2014)' but does not provide specific version numbers for any key software libraries, frameworks, or programming languages used in the implementation. |
| Experiment Setup | Yes | For other parameters, we used a consistent batch size of 128 across all methods. The learning rate was searched within the range of [3e-7, 5e-7, 8e-7, 1e-6], and all models were trained for a single epoch with a cosine learning rate schedule and a 10% warmup phase. Adam was used as the optimizer (Kingma & Ba, 2014). Additionally, the maximum sequence length was set to 2048. Table 4 outlines the hyperparameters used for Alpha DPO under various settings. |