HelpSteer2-Preference: Complementing Ratings with Preferences
Authors: Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. We perform evaluation using Reward Bench (Lambert et al., 2024), a trusted reward modeling benchmark with over 140 models on the public leaderboard. Table 1: Performance of Models on Reward Bench. |
| Researcher Affiliation | Collaboration | Zhilin Wang1 Alexander Bukharin1,2 Olivier Delalleau1 Daniel Egert1 Gerald Shen1 Jiaqi Zeng1 Oleksii Kuchaiev1 Yi Dong1 EMAIL 1NVIDIA, 2Georgia Tech, work done during internship at NVIDIA |
| Pseudocode | No | The paper describes methods using mathematical equations for loss functions and textual explanations, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Reward Model: huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward Instruct Model: huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct |
| Open Datasets | Yes | Dataset (CC-BY-4.0-License): huggingface.co/datasets/nvidia/Help Steer2 |
| Dataset Splits | Yes | Overall, we have 7,118 preference pairs with 6,766 pairs in the training set and 352 pairs in the validation set. |
| Hardware Specification | Yes | Experiments are run on nodes of 8 A100/H100-80GB SXM GPUs on internal clusters. |
| Software Dependencies | No | The paper mentions using NLTK for sentence tokenization and Scikit-Learn for kappa score calculation, and GPT-4-Turbo for evaluation, but does not provide specific version numbers for these software libraries or the framework used for model implementation. |
| Experiment Setup | Yes | Appendix E: TRAINING HYPER-PARAMETERS provides details on epochs, global batch sizes, learning rates, optimizers (AdamW), warm-up steps, and KL penalties for Reward Modelling, Direct Preference Optimization, Proximal Policy Optimization, and REINFORCE. |