reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Authors: Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, Yang Song

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby validating the theoretical superiority of our EBM.Table 1. The intrinsic evaluation based on the average Pearson coefficient [ 1, 1]) and the average slope-1 linear regression error ˆϵ shows that EPA renders a closer approximation to the slope-1 linearity than DPO. This is consistent with the extrinsic evaluation based on the Alpaca Eval 2.0 benchmark.Section 5. Experiments
Researcher Affiliation	Collaboration	Yuzhong Hong * 1 Hanshan Zhang * 2 Junwei Bao 1 Hongfei Jiang 1 Yang Song 1 *Equal contribution 1Zuoyebang Education Technology (Beijing), Co., Ltd 2Step Fun Technology Co., Ltd.. Correspondence to: Yuzhong Hong <EMAIL>, Hanshan Zhang <EMAIL>, Junwei Bao <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical formulations and textual descriptions but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	No	The paper does not contain an explicit statement by the authors about releasing their source code for the described methodology, nor does it provide a direct link to a code repository for their work. It mentions "Huggingface implementation" for evaluation and refers to "https://github.com/argilla-io/distilabel" in the references as a tool for creating datasets, but not their own code.
Open Datasets	Yes	We consider the dataset of Ultrafeedback (Cui et al., 2024) (denoted as UF-all ) and a widely used pair-wise version of it (Tunstall et al., 2023) (UF-binarized). ... We also consider Alpaca-Eval 2.0 (Dubois et al., 2024) because of its high correlation with human preference... We report metrics on GSM8k (Cobbe et al., 2021), MMLU (Hendrycks et al., 2020), and Winograd (Tikhonov & Ryabinin, 2021).
Dataset Splits	Yes	MT-Bench (Zheng et al., 2024) which also uses GPT-4 to score a response on a scale of 1-10. The metric is the average score for 80 single-turn conversations and 80 multi-turn conversations. We also consider Alpaca-Eval 2.0 (Dubois et al., 2024) because of its high correlation with human preference, the ultimate concern for RLHF. Its metrics are win-rates (with or without length control) against GPT-4-turbo across 805 test samples... The test split of the Ultrafeedback data (Tunstall et al., 2023) can fulfill this purpose because there are four y for each x and they are scored using the same scoring scheme used for our training data (i.e., UF-all and UF-binarized).
Hardware Specification	Yes	We use 8 A100/A800 GPUs (80G Memory) with Ze RO3 parallelism to train each model in this paper.
Software Dependencies	No	We use mistral-7b-sft-beta3 as the reference model and for the initialization of policy in our paper. We train all models in this paper for 3 epochs with Lo RA (r = 16, α = 16, dropout = 0.05). For evaluation on Alpaca Eval 2.0, we use the default decoding parameters in the Huggingface implementation.
Experiment Setup	Yes	We train all models in this paper for 3 epochs with Lo RA (r = 16, α = 16, dropout = 0.05). For fair comparison of baseline models, we fix β to 0.01. Learning rate is grid-searched for each method among {1e 5, 5e 6, 1e 6}. Global batch size is fixed to 64.