Models of human preference for learning reward functions

Authors: W. Bradley Knox, Stephane Hatgis-Kessell, Serena Booth, Scott Niekum, Peter Stone, Alessandro G Allievi

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For these two preference models, we first focus theoretically on a normative analysis (Section 3)... We follow up with a descriptive analysis of how well each of these proposed models align with actual human preferences by collecting a humanlabeled dataset of preferences in a rich grid world domain (Section 4) and showing that the regret preference model better predicts these human preferences (Section 5). Finally, we find that the policies ultimately created through the regret preference model tend to outperform those from the partial return model learning both when assessed with collected human preferences or when assessed with synthetic preferences (Section 6).
Researcher Affiliation Collaboration W. Bradley Knox EMAIL Bosch The University of Texas at Austin Google Research Stephane Hatgis-Kessell EMAIL The University of Texas at Austin Serena Booth EMAIL Bosch MIT CSAIL Scott Niekum EMAIL The University of Texas at Austin University of Massachusetts Amherst Peter Stone EMAIL The University of Texas at Austin Sony AI Alessandro Allievi EMAIL Bosch
Pseudocode Yes Algorithm 1 Linear reward learning with regret preference model (Pregret), using successor features
Open Source Code Yes Our code for learning and for re-running our main experiments can be found here, alongside our interface for training subjects and for preference elicitation.
Open Datasets Yes The human preferences dataset is available here Knox et al. (2023).
Dataset Splits Yes We conduct 10-fold cross validation to learn a reward scaling factor for each of Pregret and PΣr... We randomly assign human preferences from our gathered dataset to different numbers of same-sized partitions, resulting in different training set sizes, and test each preference model on each partition.
Hardware Specification Yes The computer used to run experiments shown in Figures 15,16,17,18, 19, 21, 22, and 23 had the following specification. Processor: 2x AMD EPYC 7763 (64 cores, 2.45 GHz); Memory: 284 GB. The computer used to run all other experiments had the following specification. Processor: 1x Core i9-9980XE (18 cores, 3.00 GHz) & 1x WS X299 SAGE/10G | ASUS | MOBO; GPUs: 4x RTX 2080 Ti; Memory: 128 GB.
Software Dependencies Yes Pytorch 1.7.1 (Paszke et al., 2019) was used to implement all reward learning models, and statistical analyses were performed using Scikit-learn 0.23.2 (Pedregosa et al., 2011).
Experiment Setup Yes For all models, the learning rate, softmax temperature, and number of training iterations were tuned on the noiseless synthetic preference data sets such that each model achieved an accuracy of 100% on our specific delivery task and then were tuned further on stochastic synthetic preferences on our specific delivery task. Reward learning with the partial return preference model learning rate: 2; number of training epochs: 30,000; and optimizer: Adam (with β1 =0.9 and β2 =0.999, and eps= 1e 08). Reward learning with the regret preference model learning rate: 0.5; number of training epochs: 5,000; optimizer: Adam (with β1 = 0.9, β2 = 0.999, and eps=1e 08); and softmax temperature: 0.001.