Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Authors: Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the Reward Bench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as Alpaca Eval2.0, following the language model posttraining with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values.
Researcher Affiliation Academia 1IIIS, Tsinghua University, Beijing, China 2Shanghai Qi Zhi Institute, Shanghai, China 3Department of Computer Science, University of California, Los Angeles, California, USA.
Pseudocode No The paper describes algorithms like GPO and SPPO using mathematical formulations and descriptive text, for example in Section 5 "General Preference Optimization" and Equation (5.3). However, it does not contain any clearly labeled pseudocode blocks or algorithm listings with structured steps formatted like code.
Open Source Code Yes The code is available at https://github.com/general-preference/generalpreference-model.
Open Datasets Yes We compare the GPM and BT reward model on the Reward Bench benchmark (Lambert et al., 2024), which covers diverse preference modeling tasks, including Chat, Chat Hard, Safety, and Reasoning. ... We trained both BT RMs and GPMs using the decontaminated version of Skywork Reward Data Collection (Liu & Zeng, 2024)... Specifically, we evaluate GPMs and BT RMs on Cyclic Preference datasets, which are constructed based on the Ultrafeedback dataset (Cui et al., 2024).
Dataset Splits No The paper mentions using "Skywork Reward Data Collection (Liu & Zeng, 2024)" for training and "Reward Bench (Lambert et al., 2024)" for evaluation, and also discusses a validation set used for hyperparameter tuning. For cyclic preferences, it states "we report the test accuracy on the training dataset but with different comparison pairs used in the training dataset". However, specific percentages or exact counts for train/validation/test splits are not explicitly provided for any of these datasets to allow for precise reproduction of data partitioning.
Hardware Specification Yes Hardware: All experiments were conducted on machines equipped with NVIDIA A800 80GB GPUs, utilizing 8 GPUs per experiment. ... For cyclic preference experiments... Hardware: Experiments were conducted on machines equipped with NVIDIA A800 80GB GPUs, utilizing a single GPU per experiment.
Software Dependencies No Our experiments on Reward Bench and Cyclic Preference Dataset were implemented using the Hugging Face Transformers library (Wolf et al., 2020) and the Open RLHF framework (Hu et al., 2024). The paper names these software components but does not provide specific version numbers for them.
Experiment Setup Yes Our experiments on Reward Bench and Cyclic Preference Dataset were implemented using the Hugging Face Transformers library (Wolf et al., 2020) and the Open RLHF framework (Hu et al., 2024). For reward model training on Skywork Reward Data Collection, we employed the following settings (in Table 8): Gemma-2B-it: Trained with a learning rate of 2e-6. Llama-3.1-8B-Instruct: Trained with a learning rate of 2e-6. Training Configuration: Both models were trained for two epochs with a global batch size of 32. We used a cosine learning rate scheduler with a warm-up ratio of 0.03. Input sequences were truncated to a maximum length of 2048 tokens. Hyperparameters: For our general preference embedding model (GPM), we set β = 0.1, determined via hyperparameter tuning on a validation set. ... For cyclic preference experiments, the training settings are as follows... Gemma-2B-it: Trained with a learning rate of 1e-6. Training Configuration: Models were trained for 50 epochs with a global batch size of 1.