reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adversarial Imitation Learning from Visual Observations using Latent Information

Authors: Vittorio Giammarino, James Queeney, Ioannis Paschalidis

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments that aim to answer the following questions: (1) For the V-If O problem, how does LAIf O compare to Patch AIL (Liu et al., 2023), a state-of-the-art approach for V-If O, in terms of asymptotic performance and computational efficiency? ... The results are summarized in Table 2, Figure 1, and Figure 2. Table 2 includes the asymptotic performance of each algorithm, as well as the ratio of wall-clock times between LAIf O and Patch AIL to achieve 75% of expert performance. Figure 1 depicts the average return per episode throughout training as a function of wall-clock time. Moreover, we include in Figure 2 plots showing the average return per episode as a function of training steps. These results demonstrate that LAIf O can successfully solve the V-If O problem, achieving asymptotic performance comparable to the state-of-the-art baseline Patch AIL.
Researcher Affiliation	Collaboration	Vittorio Giammarino EMAIL Division of Systems Engineering Boston University James Queeney EMAIL Mitsubishi Electric Research Laboratories Ioannis Ch. Paschalidis EMAIL Department of Electrical and Computer Engineering Division of Systems Engineering Faculty of Computing & Data Sciences Boston University
Pseudocode	Yes	We provide more implementation details and the complete pseudo-code for our algorithm in Appendix D.
Open Source Code	Yes	To ensure reproducibility, we provide free access to all the learning curves and open-source our code. All of the expert policies can be downloaded by following the instructions in our code repository.
Open Datasets	Yes	In order to address Question (1), we evaluate LAIf O and Patch AIL (Liu et al., 2023), in its weight regularized version denoted by Patch AIL-W, on 13 different tasks from the Deep Mind Control Suite (Tassa et al., 2018).
Dataset Splits	No	We use DDPG to train experts in a fully observable setting and collect 100 episodes of expert data. All the other algorithms are trained for 3 106 frames in walker run, hopper hop, cheetah run, quadruped run, and quadruped walk, and 106 frames for the other tasks. We evaluate the learned policies using average performance over 10 episodes. We run each experiment for 6 seeds. The paper specifies how expert data is collected and how evaluation is performed (10 episodes), and training durations (frames) but does not provide explicit training/test/validation splits for the main datasets used for training the learning agent.
Hardware Specification	No	For more details about the hardware used to carry out these experiments, all the learning curves, additional ablation studies, and other implementation details, refer to Appendix E and to our code. The main text refers to an appendix for hardware details, but does not provide specific hardware information itself.
Software Dependencies	No	We consider the state-of-the-art model-free RL from pixels algorithms, Dr Qv2 (Yarats et al., 2021) as a baseline. The paper mentions a specific baseline algorithm but does not list any software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Finally, from a theoretical standpoint, note that we should perform importance sampling correction in order to account for the effect of off-policy data when sampling from B (Queeney et al., 2021; 2022). However, neglecting off-policy correction works well in practice and does not compromise the stability of the algorithm (Kostrikov et al., 2018). We provide more implementation details and the complete pseudo-code for our algorithm in Appendix D. BC is trained offline using expert observation-action pairs for 104 gradient steps. All the other algorithms are trained for 3 106 frames in walker run, hopper hop, cheetah run, quadruped run, and quadruped walk, and 106 frames for the other tasks. We run each experiment for 6 seeds. The reward function rχ(z, z ) is defined as in (8), and ψ1 and ψ2 are the slow moving weights for the target Q networks.