reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diverging Preferences: When do Annotators Disagree and do Models Know?

Authors: Michael Jq Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. We train separate reward models for each dataset based on Llama-3-8B-Instruct (Dubey et al., 2024b), and evaluate on 500 held-out examples from each dataset.
Researcher Affiliation	Collaboration	1New York University 2NVIDIA 3Allen Institute for Artificial Intelligence 4University of Southern California 5University of Washington.
Pseudocode	No	The paper describes methods in prose, including mathematical formulations and diagrams (e.g., Figure 3), but does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor structured, step-by-step procedures formatted like code.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	To make this research possible, we introduce Multi Pref Disagreements and Help Steer2-Disagreements.1 With these datasets, we also include a novel taxonomy of disagreement sources... Footnote 1: Note that we did not collect new datasets but instead are releasing the individual annotations of these existing datasets... MultiPref a multi-annotated and multi-aspect human preference dataset. https://huggingface.co/datasets/allenai/multipref, 2024a. ... Help Steer2 is a dataset of 12K preference pairs... The original 10k samples at https://huggingface.co/datasets/nvidia/HelpSteer2 excludes samples with high disagreement... We look at all examples of diverging preferences from Multi Pref on prompts sourced from the Anthropic Harmless dataset (Bai et al., 2022a). ... To accomplish this, we take the underspecified prompts category from Coco Not (Brahman et al., 2024)... We use our trained distributional reward models to identify such instances in the Wild Bench benchmark, an LLM-as-Judge benchmark that sources prompts from real user-LLM interactions (Lin et al., 2024).
Dataset Splits	Yes	We train separate reward models for each dataset based on Llama-3-8B-Instruct (Dubey et al., 2024b), and evaluate on 500 held-out examples from each dataset.
Hardware Specification	Yes	All systems were trained on 8 RTX A6000 GPUs.
Software Dependencies	No	For training, we experiment using the Pytorch (Paszke et al., 2019) approximation of the normal distribution CDF Φ(x)... The paper mentions PyTorch but does not provide a specific version number, nor does it list versions for other key software components like Python or CUDA.
Experiment Setup	Yes	We train all reward models with a learning rate of 5e-5 and a batch size of 16 and were trained for a maximum of 10 epochs, selecting the best performing checkpoint evaluated after every 0.25 epochs.