A density estimation perspective on learning from pairwise human preferences
Authors: Vincent Dumoulin, Daniel D. Johnson, Pablo Samuel Castro, Hugo Larochelle, Yann Dauphin
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator s implicit preference distribution. Finally, we discuss and present findings on annotator misspecification failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints. A notebook reproducing all experiments in this paper can be accessed at https://github.com/google-deepmind/pbde. |
| Researcher Affiliation | Collaboration | EMAIL Google Deep Mind University of Toronto Mila, Université de Montréal CIFAR Fellow |
| Pseudocode | No | The paper describes algorithms and procedures using mathematical notation and prose, but it does not contain any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | A notebook reproducing all experiments in this paper can be accessed at https://github.com/google-deepmind/pbde. |
| Open Datasets | Yes | We now move to the One Billion Words Benchmark (LM1B; Chelba et al., 2013) to illustrate the consequences of annotator misspecification in a language modeling setting. |
| Dataset Splits | No | The paper describes data generation and sampling strategies for its experiments (e.g., 'drawing 2^15 observation pairs x A and x B uniformly at random'), and training on 'distinct subsets of LM1B according to sequence length'. However, it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts for model evaluation in a standard machine learning context. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or other computing specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Flax's LM1B example code' and 'Optimizer Adam (Kingma & Ba, 2015)' but does not specify version numbers for any software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | Table 1: Univariate toy experiment hyperparameters. Hyperparameter Value Architecture MLP Hidden layers 4 layers of width 64 Activation tanh Optimizer Adam (Kingma & Ba, 2015) Optimization steps 8192 Learning rate 5e-4 Learning rate schedule Cosine decay to 0.0 over 8192 steps |