Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Authors: Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning. In Section 5, we conduct two numerical experiments. The first experiment sets up four different ordinal feedback systems (oracle, 5-level, 3-level, and binary) and validates the theoretical findings that fine-grained ordinal feedback achieves higher accuracies in both in- and out-of-distribution settings. The second experiment mixes the training data with different proportions of tied and untied samples.
Researcher Affiliation Academia 1Imperial College Business School, Imperial College London, UK 2The University of Sydney Business School, The University of Sydney, Australia 3Department of Statistics and Operations Research, UNC at Chapel Hill, USA. Correspondence to: Guanting Chen <EMAIL>, Xiaocheng Li <EMAIL>.
Pseudocode Yes Algorithm 1 3-level Sampling Algorithm Input: Oracle label zoracle [0, 1] Output: Sampled label z Z3 = {0, 0.5, 1}
Open Source Code No No explicit statement or link providing access to the authors' source code for the described methodology was found in the paper.
Open Datasets Yes In the following numerical experiments, we leverage the Skywork-Reward-Preference-80K-v0.2 dataset (Liu et al., 2024a) as our base training dataset... In addition, we use the Reward Bench dataset (Lambert et al., 2024) for the out-of-distribution evaluation task... We consider two commonly used preference datasets with fine-grained preference scores, Ultra Feedback (Cui et al., 2023) and Help Steer2 (Wang et al., 2024c).
Dataset Splits Yes for each run, we randomly sample a 1024-sized subset as the hold-out evaluation dataset... we limit the training samples to 32,768 and consider 5 different proportions of the tied data
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running the experiments are provided in the paper.
Software Dependencies No Our base models for the following experiments include llama-3.2-1b-instruct (Llama Team, 2024), gemma-2-2b-it (Gemma Team, 2024), and qwen2.5-1.5b-instruct (Yang et al., 2024)... Optimizer paged adamw 32bit. The paper does not specify version numbers for these, nor for any other general software libraries or programming languages used.
Experiment Setup Yes Table 4. Hyperparameter Search Space; Table 5. Shared Hyperparameters; Table 6. Model-specified Hyperparameters. These tables detail parameters such as Learning Rate, Batch Size, Warm-up Ratio, Optimizer (paged adamw 32bit), Weight Decay (1e-3), Epochs (2), and Scheduler (Linear Warm-up + Cosine Decay).