Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Authors: Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, Yang Song

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby validating the theoretical superiority of our EBM.Table 1. The intrinsic evaluation based on the average Pearson coefficient [ 1, 1]) and the average slope-1 linear regression error ˆϵ shows that EPA renders a closer approximation to the slope-1 linearity than DPO. This is consistent with the extrinsic evaluation based on the Alpaca Eval 2.0 benchmark.Section 5. Experiments
Researcher Affiliation Collaboration Yuzhong Hong * 1 Hanshan Zhang * 2 Junwei Bao 1 Hongfei Jiang 1 Yang Song 1 *Equal contribution 1Zuoyebang Education Technology (Beijing), Co., Ltd 2Step Fun Technology Co., Ltd.. Correspondence to: Yuzhong Hong <EMAIL>, Hanshan Zhang <EMAIL>, Junwei Bao <EMAIL>.
Pseudocode No The paper describes methods using mathematical formulations and textual descriptions but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not contain an explicit statement by the authors about releasing their source code for the described methodology, nor does it provide a direct link to a code repository for their work. It mentions "Huggingface implementation" for evaluation and refers to "https://github.com/argilla-io/distilabel" in the references as a tool for creating datasets, but not their own code.
Open Datasets Yes We consider the dataset of Ultrafeedback (Cui et al., 2024) (denoted as UF-all ) and a widely used pair-wise version of it (Tunstall et al., 2023) (UF-binarized). ... We also consider Alpaca-Eval 2.0 (Dubois et al., 2024) because of its high correlation with human preference... We report metrics on GSM8k (Cobbe et al., 2021), MMLU (Hendrycks et al., 2020), and Winograd (Tikhonov & Ryabinin, 2021).
Dataset Splits Yes MT-Bench (Zheng et al., 2024) which also uses GPT-4 to score a response on a scale of 1-10. The metric is the average score for 80 single-turn conversations and 80 multi-turn conversations. We also consider Alpaca-Eval 2.0 (Dubois et al., 2024) because of its high correlation with human preference, the ultimate concern for RLHF. Its metrics are win-rates (with or without length control) against GPT-4-turbo across 805 test samples... The test split of the Ultrafeedback data (Tunstall et al., 2023) can fulfill this purpose because there are four y for each x and they are scored using the same scoring scheme used for our training data (i.e., UF-all and UF-binarized).
Hardware Specification Yes We use 8 A100/A800 GPUs (80G Memory) with Ze RO3 parallelism to train each model in this paper.
Software Dependencies No We use mistral-7b-sft-beta3 as the reference model and for the initialization of policy in our paper. We train all models in this paper for 3 epochs with Lo RA (r = 16, α = 16, dropout = 0.05). For evaluation on Alpaca Eval 2.0, we use the default decoding parameters in the Huggingface implementation.
Experiment Setup Yes We train all models in this paper for 3 epochs with Lo RA (r = 16, α = 16, dropout = 0.05). For fair comparison of baseline models, we fix β to 0.01. Learning rate is grid-searched for each method among {1e 5, 5e 6, 1e 6}. Global batch size is fixed to 64.