reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A density estimation perspective on learning from pairwise human preferences

Authors: Vincent Dumoulin, Daniel D. Johnson, Pablo Samuel Castro, Hugo Larochelle, Yann Dauphin

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical and empirical results showing that for a family of generative processes deﬁned via preference behavior distribution equations, training a reward function on pairwise preferences eﬀectively models an annotator s implicit preference distribution. Finally, we discuss and present ﬁndings on annotator misspeciﬁcation failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints. A notebook reproducing all experiments in this paper can be accessed at https://github.com/google-deepmind/pbde.
Researcher Affiliation	Collaboration	EMAIL Google Deep Mind University of Toronto Mila, Université de Montréal CIFAR Fellow
Pseudocode	No	The paper describes algorithms and procedures using mathematical notation and prose, but it does not contain any clearly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	A notebook reproducing all experiments in this paper can be accessed at https://github.com/google-deepmind/pbde.
Open Datasets	Yes	We now move to the One Billion Words Benchmark (LM1B; Chelba et al., 2013) to illustrate the consequences of annotator misspeciﬁcation in a language modeling setting.
Dataset Splits	No	The paper describes data generation and sampling strategies for its experiments (e.g., 'drawing 2^15 observation pairs x A and x B uniformly at random'), and training on 'distinct subsets of LM1B according to sequence length'. However, it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts for model evaluation in a standard machine learning context.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or other computing specifications used for running the experiments.
Software Dependencies	No	The paper mentions 'Flax's LM1B example code' and 'Optimizer Adam (Kingma & Ba, 2015)' but does not specify version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	Table 1: Univariate toy experiment hyperparameters. Hyperparameter Value Architecture MLP Hidden layers 4 layers of width 64 Activation tanh Optimizer Adam (Kingma & Ba, 2015) Optimization steps 8192 Learning rate 5e-4 Learning rate schedule Cosine decay to 0.0 over 8192 steps