reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Models of human preference for learning reward functions

Authors: W. Bradley Knox, Stephane Hatgis-Kessell, Serena Booth, Scott Niekum, Peter Stone, Alessandro G Allievi

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For these two preference models, we first focus theoretically on a normative analysis (Section 3)... We follow up with a descriptive analysis of how well each of these proposed models align with actual human preferences by collecting a humanlabeled dataset of preferences in a rich grid world domain (Section 4) and showing that the regret preference model better predicts these human preferences (Section 5). Finally, we find that the policies ultimately created through the regret preference model tend to outperform those from the partial return model learning both when assessed with collected human preferences or when assessed with synthetic preferences (Section 6).
Researcher Affiliation	Collaboration	W. Bradley Knox EMAIL Bosch The University of Texas at Austin Google Research Stephane Hatgis-Kessell EMAIL The University of Texas at Austin Serena Booth EMAIL Bosch MIT CSAIL Scott Niekum EMAIL The University of Texas at Austin University of Massachusetts Amherst Peter Stone EMAIL The University of Texas at Austin Sony AI Alessandro Allievi EMAIL Bosch
Pseudocode	Yes	Algorithm 1 Linear reward learning with regret preference model (Pregret), using successor features
Open Source Code	Yes	Our code for learning and for re-running our main experiments can be found here, alongside our interface for training subjects and for preference elicitation.
Open Datasets	Yes	The human preferences dataset is available here Knox et al. (2023).
Dataset Splits	Yes	We conduct 10-fold cross validation to learn a reward scaling factor for each of Pregret and PΣr... We randomly assign human preferences from our gathered dataset to different numbers of same-sized partitions, resulting in different training set sizes, and test each preference model on each partition.
Hardware Specification	Yes	The computer used to run experiments shown in Figures 15,16,17,18, 19, 21, 22, and 23 had the following specification. Processor: 2x AMD EPYC 7763 (64 cores, 2.45 GHz); Memory: 284 GB. The computer used to run all other experiments had the following specification. Processor: 1x Core i9-9980XE (18 cores, 3.00 GHz) & 1x WS X299 SAGE/10G \| ASUS \| MOBO; GPUs: 4x RTX 2080 Ti; Memory: 128 GB.
Software Dependencies	Yes	Pytorch 1.7.1 (Paszke et al., 2019) was used to implement all reward learning models, and statistical analyses were performed using Scikit-learn 0.23.2 (Pedregosa et al., 2011).
Experiment Setup	Yes	For all models, the learning rate, softmax temperature, and number of training iterations were tuned on the noiseless synthetic preference data sets such that each model achieved an accuracy of 100% on our specific delivery task and then were tuned further on stochastic synthetic preferences on our specific delivery task. Reward learning with the partial return preference model learning rate: 2; number of training epochs: 30,000; and optimizer: Adam (with β1 =0.9 and β2 =0.999, and eps= 1e 08). Reward learning with the regret preference model learning rate: 0.5; number of training epochs: 5,000; optimizer: Adam (with β1 = 0.9, β2 = 0.999, and eps=1e 08); and softmax temperature: 0.001.