HelpSteer2-Preference: Complementing Ratings with Preferences

Authors: Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. We perform evaluation using Reward Bench (Lambert et al., 2024), a trusted reward modeling benchmark with over 140 models on the public leaderboard. Table 1: Performance of Models on Reward Bench.
Researcher Affiliation Collaboration Zhilin Wang1 Alexander Bukharin1,2 Olivier Delalleau1 Daniel Egert1 Gerald Shen1 Jiaqi Zeng1 Oleksii Kuchaiev1 Yi Dong1 EMAIL 1NVIDIA, 2Georgia Tech, work done during internship at NVIDIA
Pseudocode No The paper describes methods using mathematical equations for loss functions and textual explanations, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes Reward Model: huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward Instruct Model: huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct
Open Datasets Yes Dataset (CC-BY-4.0-License): huggingface.co/datasets/nvidia/Help Steer2
Dataset Splits Yes Overall, we have 7,118 preference pairs with 6,766 pairs in the training set and 352 pairs in the validation set.
Hardware Specification Yes Experiments are run on nodes of 8 A100/H100-80GB SXM GPUs on internal clusters.
Software Dependencies No The paper mentions using NLTK for sentence tokenization and Scikit-Learn for kappa score calculation, and GPT-4-Turbo for evaluation, but does not provide specific version numbers for these software libraries or the framework used for model implementation.
Experiment Setup Yes Appendix E: TRAINING HYPER-PARAMETERS provides details on epochs, global batch sizes, learning rates, optimizers (AdamW), warm-up steps, and KL penalties for Reward Modelling, Direct Preference Optimization, Proximal Policy Optimization, and REINFORCE.