reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Authors: Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the Reward Bench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as Alpaca Eval2.0, following the language model posttraining with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values.
Researcher Affiliation	Academia	1IIIS, Tsinghua University, Beijing, China 2Shanghai Qi Zhi Institute, Shanghai, China 3Department of Computer Science, University of California, Los Angeles, California, USA.
Pseudocode	No	The paper describes algorithms like GPO and SPPO using mathematical formulations and descriptive text, for example in Section 5 "General Preference Optimization" and Equation (5.3). However, it does not contain any clearly labeled pseudocode blocks or algorithm listings with structured steps formatted like code.
Open Source Code	Yes	The code is available at https://github.com/general-preference/generalpreference-model.
Open Datasets	Yes	We compare the GPM and BT reward model on the Reward Bench benchmark (Lambert et al., 2024), which covers diverse preference modeling tasks, including Chat, Chat Hard, Safety, and Reasoning. ... We trained both BT RMs and GPMs using the decontaminated version of Skywork Reward Data Collection (Liu & Zeng, 2024)... Specifically, we evaluate GPMs and BT RMs on Cyclic Preference datasets, which are constructed based on the Ultrafeedback dataset (Cui et al., 2024).
Dataset Splits	No	The paper mentions using "Skywork Reward Data Collection (Liu & Zeng, 2024)" for training and "Reward Bench (Lambert et al., 2024)" for evaluation, and also discusses a validation set used for hyperparameter tuning. For cyclic preferences, it states "we report the test accuracy on the training dataset but with different comparison pairs used in the training dataset". However, specific percentages or exact counts for train/validation/test splits are not explicitly provided for any of these datasets to allow for precise reproduction of data partitioning.
Hardware Specification	Yes	Hardware: All experiments were conducted on machines equipped with NVIDIA A800 80GB GPUs, utilizing 8 GPUs per experiment. ... For cyclic preference experiments... Hardware: Experiments were conducted on machines equipped with NVIDIA A800 80GB GPUs, utilizing a single GPU per experiment.
Software Dependencies	No	Our experiments on Reward Bench and Cyclic Preference Dataset were implemented using the Hugging Face Transformers library (Wolf et al., 2020) and the Open RLHF framework (Hu et al., 2024). The paper names these software components but does not provide specific version numbers for them.
Experiment Setup	Yes	Our experiments on Reward Bench and Cyclic Preference Dataset were implemented using the Hugging Face Transformers library (Wolf et al., 2020) and the Open RLHF framework (Hu et al., 2024). For reward model training on Skywork Reward Data Collection, we employed the following settings (in Table 8): Gemma-2B-it: Trained with a learning rate of 2e-6. Llama-3.1-8B-Instruct: Trained with a learning rate of 2e-6. Training Configuration: Both models were trained for two epochs with a global batch size of 32. We used a cosine learning rate scheduler with a warm-up ratio of 0.03. Input sequences were truncated to a maximum length of 2048 tokens. Hyperparameters: For our general preference embedding model (GPM), we set β = 0.1, determined via hyperparameter tuning on a validation set. ... For cyclic preference experiments, the training settings are as follows... Gemma-2B-it: Trained with a learning rate of 1e-6. Training Configuration: Models were trained for 50 epochs with a global batch size of 1.