Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

Authors: Calarina Muslimani, Matthew E Taylor

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.
Researcher Affiliation Academia Calarina Muslimani University of Alberta EMAIL Matthew E. Taylor University of Alberta Alberta Machine Intelligence Institute EMAIL
Pseudocode Yes Algorithm 1 SDP Require: Reward model ˆrθ θ randomly initialized, Reward model data set DRM , RL agent with replay buffer Dagent , Sub-optimal data set Dsub with reward labels rmin
Open Source Code Yes To ensure the reproducibility of our work, we provide a link to our code repository: https:// github.com/cmuslima/SDP_ICLR.
Open Datasets Yes Crucially, we further highlight the real-world applicability of SDP by demonstrating its success with human teachers in a 16-person user study. Overall, this work takes an important step toward considering how human-in-the-loop RL approaches can take advantage of readily available sub-optimal data. 5 EXPERIMENTS This section considers the following four research questions (RQ s): RQ 1: Can SDP improve upon existing scalarand preference-based RL methods? RQ 2: Can SDP effectively leverage sub-optimal trajectories from different tasks to improve performance on a target task? RQ 3: Can SDP be used with real human feedback? RQ 4: How sensitive is SDP to various hyperparameters? 5.1 EXPERIMENTAL DESIGN To demonstrate the versatility of SDP, we apply SDP to both preference and scalar-based RL approaches. However, as preference feedback can be less time-consuming than scalar feedback, we primarily concentrate on preference-based RL in our experiments, exploring scalar feedback in a smaller capacity. For the preference-based experiments, we combine SDP with four contemporary preference-based algorithms: PEBBLE (Lee et al., 2021a), RUNE (Liang et al., 2022), SURF (Park et al., 2022), and MRN (Liu et al., 2022). We benchmark the performance of the four algorithms augmented with SDP against their original versions without SDP, as well as against SAC. We treat SAC (Haarnoja et al., 2018) as an oracle (i.e., upper bound) because it learns while accessing the ground truth reward function, which is unavailable to the other algorithms. For the scalar-based experiments, we combine SDP with R-PEBBLE (a regression variant of PEBBLE). We compare SDP + R-PEBBLE against R-PEBBLE, Deep TAMER (Warnell et al., 2018) (a scalar feedback RL algorithm), and SAC. We note that SAC is the core RL algorithm used across all baselines. Implementation Details For SDP, we collected sub-optimal trajectories via a random policy. In particular, we used 50000 state, action transitions for all experiments in Section 5.2. Note that we do not require explicit access to a sub-optimal policy; we only require state, action transitions from said policy. Moreover, to ensure a fair comparison across algorithms, we maintained equal feedback budgets for all algorithms within each environment, while adjusting the budget across environments to reflect their degree of difficulty. See Appendix A for a complete overview of the implementation process and specific hyperparameters for all algorithms. Evaluation We show average offline performance (i.e., freeze the policy and evaluate it with no exploration) over ten episodes using either the ground truth reward function (DMControl experiments) or the success rate (Meta-World experiments). It is important to note that only SAC has access to the ground truth reward function. We perform this evaluation every 10000 training steps. To systemically evaluate performance, we use a simulated teacher that provides either a scalar rating of a single trajectory segment or preferences between two trajectory segments according to the ground truth reward function. To thoroughly test the effectiveness of SDP, we perform evaluations on four robotic locomotion tasks from the DMControl Suite: Walker-walk, Cheetah-run, Quadrupedwalk, and Cartpole-swingup, and five robotic manipulation tasks from Meta-World: Hammer, Doorunlock, Door-lock, Drawer-open, and Window-open. In our experiments, the results are averaged over five seeds with shaded regions or error bars indicating 95% confidence intervals. To test for significant differences in final performance (i.e., the undiscounted return) and learning efficiency (i.e., the total area under the return curve, AUC), we perform Welch t-tests (equal variances not assumed) with a p-value of 0.05. See Appendices D.8 and D.9, Tables 9-14 for a summary of final performance and AUC across all experiments.
Dataset Splits Yes In particular, we used 50000 state, action transitions for all experiments in Section 5.2. ... We evaluate SDP and R-PEBBLE with feedback budgets [60, 100, 200] to analyze the impact of feedback quantity on performance. ... We evaluate the performance of SDP using varying amounts of sub-optimal transitions [5000, 15000, 50000].
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments (e.g., GPU models, CPU models, or detailed computer specifications).
Software Dependencies No The paper mentions software like Adam optimizer (Kingma & Ba, 2015) and refers to existing algorithms like PEBBLE and SAC, but it does not specify version numbers for these software components or the underlying programming languages/libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In all of our experiments, we use the hyperparameters in Table 1 for the reward models used in all benchmarks. For the agent update phase of SDP, an additional hyperparameter is associated with the number of environment interactions made before the standard preference/scalar feedback learning loop begins. ... Furthermore, we use most of the existing reward model hyperparameters used in PEBBLE, however, we adjusted the following four hyperparameters: feedback frequency, amount of feedback per session, trajectory segment size (only for Meta-world), and activation function for the final NN layer. ... As for the SAC hyperparameters, we use the values found in Tables 3-4.