Benchmarks and Algorithms for Offline Preference-Based Reward Learning

Authors: Daniel Shin, Anca Dragan, Daniel S. Brown

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To test our approach, we first evaluate existing offline RL benchmarks for their suitability for offline reward learning. ... When evaluated on this curated set of domains, our empirical results suggest that combining offline RL with learned human preferences can enable an agent to learn to perform novel tasks that were not explicitly shown in the offline data.
Researcher Affiliation Academia Daniel Shin EMAIL Computer Science Department Stanford University Anca D. Dragan EMAIL EECS Department University of California, Berkeley Daniel S. Brown EMAIL School of Computing University of Utah
Pseudocode Yes Algorithm 1 OPRL
Open Source Code Yes Videos of learned behavior and code is available in the Supplement.
Open Datasets Yes We first evaluate a variety of popular offline RL benchmarks from D4RL (Fu et al., 2020) to determine which domains are most suited for evaluating offline reward learning.
Dataset Splits No The paper discusses evaluating performance on existing datasets like D4RL and on new datasets created by the authors. For instance, it mentions 'training with the ground-truth reward function on the full dataset of 1 million state transitions' and 'Our experimental setup is similar to Maze2D, except we start with 50 pairs of trajectories instead of 5 and we add 10 trajectories per round of active queries instead of 1 query per round.' However, it does not explicitly provide information about predefined training, validation, or test dataset splits in terms of percentages, absolute sample counts, or citations to standard splits for the experimental evaluation of policies or reward models.
Hardware Specification Yes All models are trained on an Azure Standard NC24 Promo instance, with 24 v CPUs, 224 Gi B of RAM and 4 x K80 GPU (2 Physical Cards).
Software Dependencies No The paper mentions using a 'neural network' and 'deep learning' which implies frameworks like PyTorch or TensorFlow, and refers to 'offline RL algorithms' (e.g., AWR, CQL), but it does not specify any software components with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup Yes For our experimental setup, we first randomly select 5 pairs of trajectory snippets and train 5 epochs with our models. After this initial training process, for each round, one additional pair of trajectories is queried to be added to the training set and we train one more epoch on this augmented dataset. ... For policy learning with AWR, lower dimensional environments including Maze2D-Umaze, Maze2D-Medium, and Hopper are ran with 400 iterations. Higher dimensional environments including Halfcheetah, Flow-Merge Random, and Kitchen-Complete are ran with 1000 iterations. ... For CQL, policy learning rate is 1e-4, lagrange threshold is -1.0, min q weights is 5.0, min q version is 3, and policy eval start is 0.