AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Authors: Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Alpha DPO achieves state-of-the-art performance on Alpaca Eval 2 (58.7% LC win rate) and Arena-Hard (35.7% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training.
Researcher Affiliation Collaboration 1Mo E Key Lab of BIPC, University of Science and Technology of China 2Alibaba Group. Correspondence to: Xiang Wang <EMAIL>, Xiangnan He <EMAIL>.
Pseudocode No The paper describes methods in prose and mathematical formulations within the "Proposed Method: Alpha DPO" section, but does not contain a dedicated pseudocode block or algorithm listing.
Open Source Code Yes The code is available at https://github.com/junkangwu/alpha-DPO.
Open Datasets Yes For a fair comparison, we use the same training data as Sim PO: princeton-nlp/llama3-ultrafeedback-armorm1, princeton-nlp/mistral-instruct-ultrafeedback2, and princeton-nlp/gemma2-ultrafeedback-armorm 3 for Llama3-8B, Mistral2-7B, and Gemma2-9B, respectively. Additionally, the v0.2 Llama3-Instruct setup uses RLHFlow/Armo RM-Llama3-8B-v0.1 (Wang et al., 2024b)
Dataset Splits No The paper states, 'For a fair comparison, we use the same training data as Sim PO: princeton-nlp/llama3-ultrafeedback-armorm1, princeton-nlp/mistral-instruct-ultrafeedback2, and princeton-nlp/gemma2-ultrafeedback-armorm 3 for Llama3-8B, Mistral2-7B, and Gemma2-9B, respectively.' However, it does not explicitly provide details about training, validation, or test dataset splits for these datasets.
Hardware Specification Yes All training experiments presented in this paper were conducted using 8 A100 GPUs, as per the procedures detailed in the alignment-handbook repository.
Software Dependencies No The paper mentions 'Adam was used as the optimizer (Kingma & Ba, 2014)' but does not provide specific version numbers for any key software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup Yes For other parameters, we used a consistent batch size of 128 across all methods. The learning rate was searched within the range of [3e-7, 5e-7, 8e-7, 1e-6], and all models were trained for a single epoch with a cosine learning rate schedule and a 10% warmup phase. Adam was used as the optimizer (Kingma & Ba, 2014). Additionally, the maximum sequence length was set to 2048. Table 4 outlines the hyperparameters used for Alpha DPO under various settings.