reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Authors: Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning. In Section 5, we conduct two numerical experiments. The first experiment sets up four different ordinal feedback systems (oracle, 5-level, 3-level, and binary) and validates the theoretical findings that fine-grained ordinal feedback achieves higher accuracies in both in- and out-of-distribution settings. The second experiment mixes the training data with different proportions of tied and untied samples.
Researcher Affiliation	Academia	1Imperial College Business School, Imperial College London, UK 2The University of Sydney Business School, The University of Sydney, Australia 3Department of Statistics and Operations Research, UNC at Chapel Hill, USA. Correspondence to: Guanting Chen <EMAIL>, Xiaocheng Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 3-level Sampling Algorithm Input: Oracle label zoracle [0, 1] Output: Sampled label z Z3 = {0, 0.5, 1}
Open Source Code	No	No explicit statement or link providing access to the authors' source code for the described methodology was found in the paper.
Open Datasets	Yes	In the following numerical experiments, we leverage the Skywork-Reward-Preference-80K-v0.2 dataset (Liu et al., 2024a) as our base training dataset... In addition, we use the Reward Bench dataset (Lambert et al., 2024) for the out-of-distribution evaluation task... We consider two commonly used preference datasets with fine-grained preference scores, Ultra Feedback (Cui et al., 2023) and Help Steer2 (Wang et al., 2024c).
Dataset Splits	Yes	for each run, we randomly sample a 1024-sized subset as the hold-out evaluation dataset... we limit the training samples to 32,768 and consider 5 different proportions of the tied data
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running the experiments are provided in the paper.
Software Dependencies	No	Our base models for the following experiments include llama-3.2-1b-instruct (Llama Team, 2024), gemma-2-2b-it (Gemma Team, 2024), and qwen2.5-1.5b-instruct (Yang et al., 2024)... Optimizer paged adamw 32bit. The paper does not specify version numbers for these, nor for any other general software libraries or programming languages used.
Experiment Setup	Yes	Table 4. Hyperparameter Search Space; Table 5. Shared Hyperparameters; Table 6. Model-specified Hyperparameters. These tables detail parameters such as Learning Rate, Batch Size, Warm-up Ratio, Optimizer (paged adamw 32bit), Weight Decay (1e-3), Epochs (2), and Scheduler (Linear Warm-up + Cosine Decay).