reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explicit Preference Optimization: No Need for an Implicit Reward Model

Authors: Xiangkun Hu, Lemin Kong, Tong He, David Wipf

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Section 5 provides empirical verification of conditions, in a controlled environment with known ground-truth preferences, whereby DPO-based regularization and related variants converge to degenerate minimizers while ℓEXPO minimizers do not. We then conclude with experiments involving real-world alignment data that show EXPO outperforms DPO-based models w.r.t. response win rates.
Researcher Affiliation	Collaboration	1Amazon Web Services 2The Chinese University of Hong Kong. Correspondence to: Xiangkun Hu <EMAIL>, Lemin Kong <EMAIL>, Tong He <EMAIL>, David Wipf <EMAIL>.
Pseudocode	No	The paper describes methods and derivations mathematically but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/lmkong020/explicit-preference-optimization.
Open Datasets	Yes	We then push beyond (Azar et al., 2024), which presents no real-world validation of IPO or other methods, and compare our EXPO framework using the Anthropic Helpfulness and Harmlessness (HH) real-world preference dataset (Bai et al., 2022a; Ganguli et al., 2022). [...] We also evaluate our models using the Llama-3-Base-8B (AI@Meta, 2024) on the widely-used open-ended instruction-following benchmark Alpaca Eval 2 (Li et al., 2023).
Dataset Splits	No	The paper mentions using 'test set' for evaluation on Anthropic HH and IMDb datasets and references Alpaca Eval 2 as a benchmark of questions, but does not provide specific percentages, sample counts, or explicit methodology for how these datasets are split into training, validation, and test sets. It relies on implicit splits within the benchmarks/datasets without detailing them.
Hardware Specification	Yes	All training was conducted using an 8 A100 40G GPU instance and the Adam optimizer (Kingma & Ba, 2014).
Software Dependencies	No	The paper mentions using the Adam optimizer and adapting an official DPO GitHub repository, but it does not specify version numbers for programming languages, libraries, or frameworks like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For the results in Figure 7, we train the base SFT model for 2 epochs and all the other models for 1 epoch, using a learning rate of 1 10 6 and a batch size of 40. We set λ = 0.1 for DPO and IPO. For EXPO (reg), we set λ = 0.2; we also found that increasing λ to 0.5 did not substantially alter EXPO performance. For EXPO (comp) we used λ = 0.05 since again, its influence is different between the two variants.