A Ranking Game for Imitation Learning
Authors: Harshit Sikchi, Akanksha Saran, Wonjoon Goo, Scott Niekum
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (Lf O) setting. |
| Researcher Affiliation | Collaboration | Harshit Sikchi EMAIL Department of Computer Science, The University of Texas at Austin Akanksha Saran EMAIL Microsoft Research NYC Wonjoon Goo EMAIL Department of Computer Science, The University of Texas at Austin Scott Niekum EMAIL Department of Computer Science, University of Massachusetts Amherst |
| Pseudocode | Yes | Algorithm 1 Meta algorithm: rank-game (vanilla) for imitation Algorithm 2 Policy As Leader (PAL) practical instantiation Algorithm 3 Reward As Leader (RAL) practical instantiation |
| Open Source Code | Yes | Project video and code can be found at this URL. |
| Open Datasets | Yes | We compare rank-game against state-of-the-art Lf O and Lf D approaches on Mu Jo Co benchmarks having continuous state and action spaces. Door opening environment from Robosuite (Zhu et al., 2020a) (licensed under MIT License) and the Pen-v0 environment from mjrl (Rajeswaran et al., 2017) (licensed under Apache License 2.0). obtained via D4RL (licensed under CC BY) |
| Dataset Splits | No | The paper mentions using a single expert trajectory for some experiments and discusses how offline rankings are sampled, but it does not provide explicit training/test/validation splits for the primary datasets used in the experiments. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, only mentioning general environments like 'simulated Mu Jo Co domains'. |
| Software Dependencies | No | The paper mentions using Soft Actor-Critic (SAC) (Haarnoja et al., 2018) and building upon SAC code (Achiam, 2018) but does not provide specific version numbers for these or other software libraries/frameworks. |
| Experiment Setup | Yes | For reward learning, we use an MLP parameterized by two hidden layers of 64 dimensions each. Furthermore, we clip the outputs of the reward network between [ 10, 10] range to keep the range of rewards bounded while also adding an L2 regularization of 0.01. Hyperparameters for RANK-{PAL,RAL} (vanilla,auto and pref) methods are shown in Table 5. For RANK-PAL, we found the following hyperparameters to give best results: npol = H and nrew = ( validation or H/b), where H is the environment horizon (usually set to 1000 for Mu Jo Co locomotion tasks) and b is the batch size used for the reward update. For RANK-RAL, we found npol = H and nrew = ( validation or |D|/b), where |D| indicates the cumulative size of the ranking dataset. ... Table 5: Common hyperparameters for the RANK-GAME algorithms. Square brackets in the left column indicate which hyperparameters that are specific to auto and pref methods. Hyperparameter Value Policy updates npol H Reward batch size(b) 1024 Reward gradient updates nrew val or |D|/1024 Reward learning rate 1e-3 Reward clamp range [-10,10] Reward l2 weight decay 0.0001 Number of interpolations [auto] 5 Reward shaping parameterization [auto] exp-[-1] Offline rankings loss weight (λ) [pref] 0.3 Snippet length l [pref] 10 |