AlphaPO: Reward Shape Matters for LLM Alignment

Authors: Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, Jiwoo Hong, Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Siyu Zhu, Parag Agrawal, Natesh S. Pillai, Sathiya Keerthi

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Alpha PO outperforms Sim PO and DPO on most benchmarks as demonstrated in Table 1. Focusing on Pair RMbased results, Alpha PO outperforms Sim PO and DPO across both AE2 and AH for the instruct versions of the Llama 3 and Mistral models. The results are especially pronounced for AE2 where Alpha PO relatively improves over Sim PO by 7% to 10% for LC, and over DPO by 15% for Llama 3 and 50% for Mistral.
Researcher Affiliation Collaboration Aman Gupta * 1 Shao Tang * 1 Qingquan Song 1 Sirou Zhu 1 Jiwoo Hong 2 Ankan Saha 1 Viral Gupta 1 Noah Lee 2 Eunki Kim 2 Siyu Zhu 1 Parag Agrawal 1 Natesh Pillai 1 S. Sathiya Keerthi 1 *Equal contribution 1Linked In Corporation, CA, USA 2KAIST AI, KAIST, South Korea
Pseudocode No The paper describes methods and theoretical analysis using mathematical equations and proofs (e.g., Theorem 3.1, Corollary 3.3, Lemma A.1) but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, step-by-step procedures.
Open Source Code No The paper does not provide an explicit statement or link to the source code for the Alpha PO methodology. It mentions 'Sim PO github' in Appendix A.6 but this refers to a baseline method, not the authors' own implementation.
Open Datasets Yes Datasets We chose the Ultra Feedback (UF) dataset (Cui et al., 2024) for all experiments. Previous works (Meng et al., 2024; Wu et al., 2024; Zhao et al., 2024) have demonstrated that using an on-policy setting for the instruct setup helps mitigate the distribution shift between off-the-shelf instruct variants of these models and the preference optimization process. Following (Meng et al., 2024), specifically, we regenerate five responses for every prompt in the UF dataset using a sampling temperature of 0.8. We then use two reward models Pair RM (Jiang et al., 2023b) and Armo RM (Wang et al., 2024b) to rank the 5 responses. The highest scoring response is labeled yw and the lowest scoring response is labeled yl. We use the Pair RM-based dataset to conduct experiments for Llama 3 and Mistral, and leverage the Armo RM-based dataset for Llama 3 and Gemma 2-based experiments. Evaluation. We evaluate trained models using two popular benchmarks Alpaca Eval 2.0 (Dubois et al., 2024) and Arena-Hard (Wang et al., 2024b) (referred to as AE2 and AH, respectively, hereinafter). Additional experiments on other datasets To understand the impact of Alpha PO on datasets other than those used traditionally for alignment, we compare Alpha PO and Sim PO-based checkpoints of the Mistral and Llama models on Hella Swag (Zellers et al., 2019) and Truthful QA (Lin et al., 2021).
Dataset Splits No The paper mentions using a "20k sampled subset from the Ultra Feedback dataset" (Appendix A.9) and creating preferred/dispreferred response pairs from this dataset, but it does not specify explicit training, validation, or test splits (e.g., percentages or counts for each set) for the overall dataset used in experiments.
Hardware Specification Yes All the training experiments in this paper were conducted on 8 A100 GPUs with the adamw torch optimizer based on the alignment-handbook.
Software Dependencies No The paper mentions 'adamw torch optimizer' but does not provide specific version numbers for PyTorch or any other software libraries used, which are necessary for reproducible software dependencies.
Experiment Setup Yes Training hyper-parameter tuning Following the recommendations of Sim PO, we adopt a global batch size of 128, a maximum sequence length of 2048, and a cosine learning rate schedule with a warmup ratio of 0.1 for one epoch across all training settings. For DPO, we use the best β and learning rate values reported in Sim PO github. Table 3. Best hyperparameters for training. Method Model α β γ/β Learning Rate DPO Mistral-Instruct 0.01 5 10 7 Llama-3-Instruct 0.01 7 10 7 Llama-3-Instruct (Armo RM) 0.01 3 10 7 Gemma-2-Instruct 0.01 5 10 7 Sim PO Mistral-Instruct 2.5 0.1 5 10 7 Llama-3-Instruct 2.5 0.55 1 10 6 Llama-3-Instruct (Armo RM) 10.0 0.3 1 10 6 Gemma-2-Instruct 10 0.5 8 10 7 Alpha PO Mistral-Instruct 0.25 2.5 0.1 7 10 7 Llama-3-Instruct 0.25 2.5 1.0 1 10 6 Llama-3-Instruct (Armo RM) 0.25 10.0 0.3 1.1 10 6 Gemma-2-Instruct 0.1 10 0.5 8 10 7 Decoding hyperparameters For Alpaca Eval 2.0, we adopt the default settings for Alpaca Eval 2.0 with weighted alpaca eval gpt4 turbo as the annotator and use gpt4 turbo as the reference model. We use a sampling decoding strategy to generate responses, with a temperature of 0.5 for the Mistral-Instruct setting and a temperature of 0.9 for Llama-3-Instruct settings following from the Sim PO configs. We use a tempreture of 0.7 following from the WPO-HB config for better reproducibility. For Arena-Hard, we use the default greedy decoding for all settings and methods.