A Ranking Game for Imitation Learning

Authors: Harshit Sikchi, Akanksha Saran, Wonjoon Goo, Scott Niekum

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (Lf O) setting.
Researcher Affiliation Collaboration Harshit Sikchi EMAIL Department of Computer Science, The University of Texas at Austin Akanksha Saran EMAIL Microsoft Research NYC Wonjoon Goo EMAIL Department of Computer Science, The University of Texas at Austin Scott Niekum EMAIL Department of Computer Science, University of Massachusetts Amherst
Pseudocode Yes Algorithm 1 Meta algorithm: rank-game (vanilla) for imitation Algorithm 2 Policy As Leader (PAL) practical instantiation Algorithm 3 Reward As Leader (RAL) practical instantiation
Open Source Code Yes Project video and code can be found at this URL.
Open Datasets Yes We compare rank-game against state-of-the-art Lf O and Lf D approaches on Mu Jo Co benchmarks having continuous state and action spaces. Door opening environment from Robosuite (Zhu et al., 2020a) (licensed under MIT License) and the Pen-v0 environment from mjrl (Rajeswaran et al., 2017) (licensed under Apache License 2.0). obtained via D4RL (licensed under CC BY)
Dataset Splits No The paper mentions using a single expert trajectory for some experiments and discusses how offline rankings are sampled, but it does not provide explicit training/test/validation splits for the primary datasets used in the experiments.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, only mentioning general environments like 'simulated Mu Jo Co domains'.
Software Dependencies No The paper mentions using Soft Actor-Critic (SAC) (Haarnoja et al., 2018) and building upon SAC code (Achiam, 2018) but does not provide specific version numbers for these or other software libraries/frameworks.
Experiment Setup Yes For reward learning, we use an MLP parameterized by two hidden layers of 64 dimensions each. Furthermore, we clip the outputs of the reward network between [ 10, 10] range to keep the range of rewards bounded while also adding an L2 regularization of 0.01. Hyperparameters for RANK-{PAL,RAL} (vanilla,auto and pref) methods are shown in Table 5. For RANK-PAL, we found the following hyperparameters to give best results: npol = H and nrew = ( validation or H/b), where H is the environment horizon (usually set to 1000 for Mu Jo Co locomotion tasks) and b is the batch size used for the reward update. For RANK-RAL, we found npol = H and nrew = ( validation or |D|/b), where |D| indicates the cumulative size of the ranking dataset. ... Table 5: Common hyperparameters for the RANK-GAME algorithms. Square brackets in the left column indicate which hyperparameters that are specific to auto and pref methods. Hyperparameter Value Policy updates npol H Reward batch size(b) 1024 Reward gradient updates nrew val or |D|/1024 Reward learning rate 1e-3 Reward clamp range [-10,10] Reward l2 weight decay 0.0001 Number of interpolations [auto] 5 Reward shaping parameterization [auto] exp-[-1] Offline rankings loss weight (λ) [pref] 0.3 Snippet length l [pref] 10