reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Ranking Game for Imitation Learning

Authors: Harshit Sikchi, Akanksha Saran, Wonjoon Goo, Scott Niekum

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (Lf O) setting.
Researcher Affiliation	Collaboration	Harshit Sikchi EMAIL Department of Computer Science, The University of Texas at Austin Akanksha Saran EMAIL Microsoft Research NYC Wonjoon Goo EMAIL Department of Computer Science, The University of Texas at Austin Scott Niekum EMAIL Department of Computer Science, University of Massachusetts Amherst
Pseudocode	Yes	Algorithm 1 Meta algorithm: rank-game (vanilla) for imitation Algorithm 2 Policy As Leader (PAL) practical instantiation Algorithm 3 Reward As Leader (RAL) practical instantiation
Open Source Code	Yes	Project video and code can be found at this URL.
Open Datasets	Yes	We compare rank-game against state-of-the-art Lf O and Lf D approaches on Mu Jo Co benchmarks having continuous state and action spaces. Door opening environment from Robosuite (Zhu et al., 2020a) (licensed under MIT License) and the Pen-v0 environment from mjrl (Rajeswaran et al., 2017) (licensed under Apache License 2.0). obtained via D4RL (licensed under CC BY)
Dataset Splits	No	The paper mentions using a single expert trajectory for some experiments and discusses how offline rankings are sampled, but it does not provide explicit training/test/validation splits for the primary datasets used in the experiments.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, only mentioning general environments like 'simulated Mu Jo Co domains'.
Software Dependencies	No	The paper mentions using Soft Actor-Critic (SAC) (Haarnoja et al., 2018) and building upon SAC code (Achiam, 2018) but does not provide specific version numbers for these or other software libraries/frameworks.
Experiment Setup	Yes	For reward learning, we use an MLP parameterized by two hidden layers of 64 dimensions each. Furthermore, we clip the outputs of the reward network between [ 10, 10] range to keep the range of rewards bounded while also adding an L2 regularization of 0.01. Hyperparameters for RANK-{PAL,RAL} (vanilla,auto and pref) methods are shown in Table 5. For RANK-PAL, we found the following hyperparameters to give best results: npol = H and nrew = ( validation or H/b), where H is the environment horizon (usually set to 1000 for Mu Jo Co locomotion tasks) and b is the batch size used for the reward update. For RANK-RAL, we found npol = H and nrew = ( validation or \|D\|/b), where \|D\| indicates the cumulative size of the ranking dataset. ... Table 5: Common hyperparameters for the RANK-GAME algorithms. Square brackets in the left column indicate which hyperparameters that are specific to auto and pref methods. Hyperparameter Value Policy updates npol H Reward batch size(b) 1024 Reward gradient updates nrew val or \|D\|/1024 Reward learning rate 1e-3 Reward clamp range [-10,10] Reward l2 weight decay 0.0001 Number of interpolations [auto] 5 Reward shaping parameterization [auto] exp-[-1] Offline rankings loss weight (λ) [pref] 0.3 Snippet length l [pref] 10