Explicit Preference Optimization: No Need for an Implicit Reward Model
Authors: Xiangkun Hu, Lemin Kong, Tong He, David Wipf
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 5 provides empirical verification of conditions, in a controlled environment with known ground-truth preferences, whereby DPO-based regularization and related variants converge to degenerate minimizers while ℓEXPO minimizers do not. We then conclude with experiments involving real-world alignment data that show EXPO outperforms DPO-based models w.r.t. response win rates. |
| Researcher Affiliation | Collaboration | 1Amazon Web Services 2The Chinese University of Hong Kong. Correspondence to: Xiangkun Hu <EMAIL>, Lemin Kong <EMAIL>, Tong He <EMAIL>, David Wipf <EMAIL>. |
| Pseudocode | No | The paper describes methods and derivations mathematically but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/lmkong020/explicit-preference-optimization. |
| Open Datasets | Yes | We then push beyond (Azar et al., 2024), which presents no real-world validation of IPO or other methods, and compare our EXPO framework using the Anthropic Helpfulness and Harmlessness (HH) real-world preference dataset (Bai et al., 2022a; Ganguli et al., 2022). [...] We also evaluate our models using the Llama-3-Base-8B (AI@Meta, 2024) on the widely-used open-ended instruction-following benchmark Alpaca Eval 2 (Li et al., 2023). |
| Dataset Splits | No | The paper mentions using 'test set' for evaluation on Anthropic HH and IMDb datasets and references Alpaca Eval 2 as a benchmark of questions, but does not provide specific percentages, sample counts, or explicit methodology for how these datasets are split into training, validation, and test sets. It relies on implicit splits within the benchmarks/datasets without detailing them. |
| Hardware Specification | Yes | All training was conducted using an 8 A100 40G GPU instance and the Adam optimizer (Kingma & Ba, 2014). |
| Software Dependencies | No | The paper mentions using the Adam optimizer and adapting an official DPO GitHub repository, but it does not specify version numbers for programming languages, libraries, or frameworks like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For the results in Figure 7, we train the base SFT model for 2 epochs and all the other models for 1 epoch, using a learning rate of 1 10 6 and a batch size of 40. We set λ = 0.1 for DPO and IPO. For EXPO (reg), we set λ = 0.2; we also found that increasing λ to 0.5 did not substantially alter EXPO performance. For EXPO (comp) we used λ = 0.05 since again, its influence is different between the two variants. |