Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment
Authors: Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the Reward Bench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as Alpaca Eval2.0, following the language model posttraining with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. |
| Researcher Affiliation | Academia | 1IIIS, Tsinghua University, Beijing, China 2Shanghai Qi Zhi Institute, Shanghai, China 3Department of Computer Science, University of California, Los Angeles, California, USA. |
| Pseudocode | No | The paper describes algorithms like GPO and SPPO using mathematical formulations and descriptive text, for example in Section 5 "General Preference Optimization" and Equation (5.3). However, it does not contain any clearly labeled pseudocode blocks or algorithm listings with structured steps formatted like code. |
| Open Source Code | Yes | The code is available at https://github.com/general-preference/generalpreference-model. |
| Open Datasets | Yes | We compare the GPM and BT reward model on the Reward Bench benchmark (Lambert et al., 2024), which covers diverse preference modeling tasks, including Chat, Chat Hard, Safety, and Reasoning. ... We trained both BT RMs and GPMs using the decontaminated version of Skywork Reward Data Collection (Liu & Zeng, 2024)... Specifically, we evaluate GPMs and BT RMs on Cyclic Preference datasets, which are constructed based on the Ultrafeedback dataset (Cui et al., 2024). |
| Dataset Splits | No | The paper mentions using "Skywork Reward Data Collection (Liu & Zeng, 2024)" for training and "Reward Bench (Lambert et al., 2024)" for evaluation, and also discusses a validation set used for hyperparameter tuning. For cyclic preferences, it states "we report the test accuracy on the training dataset but with different comparison pairs used in the training dataset". However, specific percentages or exact counts for train/validation/test splits are not explicitly provided for any of these datasets to allow for precise reproduction of data partitioning. |
| Hardware Specification | Yes | Hardware: All experiments were conducted on machines equipped with NVIDIA A800 80GB GPUs, utilizing 8 GPUs per experiment. ... For cyclic preference experiments... Hardware: Experiments were conducted on machines equipped with NVIDIA A800 80GB GPUs, utilizing a single GPU per experiment. |
| Software Dependencies | No | Our experiments on Reward Bench and Cyclic Preference Dataset were implemented using the Hugging Face Transformers library (Wolf et al., 2020) and the Open RLHF framework (Hu et al., 2024). The paper names these software components but does not provide specific version numbers for them. |
| Experiment Setup | Yes | Our experiments on Reward Bench and Cyclic Preference Dataset were implemented using the Hugging Face Transformers library (Wolf et al., 2020) and the Open RLHF framework (Hu et al., 2024). For reward model training on Skywork Reward Data Collection, we employed the following settings (in Table 8): Gemma-2B-it: Trained with a learning rate of 2e-6. Llama-3.1-8B-Instruct: Trained with a learning rate of 2e-6. Training Configuration: Both models were trained for two epochs with a global batch size of 32. We used a cosine learning rate scheduler with a warm-up ratio of 0.03. Input sequences were truncated to a maximum length of 2048 tokens. Hyperparameters: For our general preference embedding model (GPM), we set β = 0.1, determined via hyperparameter tuning on a validation set. ... For cyclic preference experiments, the training settings are as follows... Gemma-2B-it: Trained with a learning rate of 1e-6. Training Configuration: Models were trained for 50 epochs with a global batch size of 1. |