Adversarial Imitation Learning from Visual Observations using Latent Information

Authors: Vittorio Giammarino, James Queeney, Ioannis Paschalidis

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments that aim to answer the following questions: (1) For the V-If O problem, how does LAIf O compare to Patch AIL (Liu et al., 2023), a state-of-the-art approach for V-If O, in terms of asymptotic performance and computational efficiency? ... The results are summarized in Table 2, Figure 1, and Figure 2. Table 2 includes the asymptotic performance of each algorithm, as well as the ratio of wall-clock times between LAIf O and Patch AIL to achieve 75% of expert performance. Figure 1 depicts the average return per episode throughout training as a function of wall-clock time. Moreover, we include in Figure 2 plots showing the average return per episode as a function of training steps. These results demonstrate that LAIf O can successfully solve the V-If O problem, achieving asymptotic performance comparable to the state-of-the-art baseline Patch AIL.
Researcher Affiliation Collaboration Vittorio Giammarino EMAIL Division of Systems Engineering Boston University James Queeney EMAIL Mitsubishi Electric Research Laboratories Ioannis Ch. Paschalidis EMAIL Department of Electrical and Computer Engineering Division of Systems Engineering Faculty of Computing & Data Sciences Boston University
Pseudocode Yes We provide more implementation details and the complete pseudo-code for our algorithm in Appendix D.
Open Source Code Yes To ensure reproducibility, we provide free access to all the learning curves and open-source our code. All of the expert policies can be downloaded by following the instructions in our code repository.
Open Datasets Yes In order to address Question (1), we evaluate LAIf O and Patch AIL (Liu et al., 2023), in its weight regularized version denoted by Patch AIL-W, on 13 different tasks from the Deep Mind Control Suite (Tassa et al., 2018).
Dataset Splits No We use DDPG to train experts in a fully observable setting and collect 100 episodes of expert data. All the other algorithms are trained for 3 106 frames in walker run, hopper hop, cheetah run, quadruped run, and quadruped walk, and 106 frames for the other tasks. We evaluate the learned policies using average performance over 10 episodes. We run each experiment for 6 seeds. The paper specifies how expert data is collected and how evaluation is performed (10 episodes), and training durations (frames) but does not provide explicit training/test/validation splits for the main datasets used for training the learning agent.
Hardware Specification No For more details about the hardware used to carry out these experiments, all the learning curves, additional ablation studies, and other implementation details, refer to Appendix E and to our code. The main text refers to an appendix for hardware details, but does not provide specific hardware information itself.
Software Dependencies No We consider the state-of-the-art model-free RL from pixels algorithms, Dr Qv2 (Yarats et al., 2021) as a baseline. The paper mentions a specific baseline algorithm but does not list any software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Finally, from a theoretical standpoint, note that we should perform importance sampling correction in order to account for the effect of off-policy data when sampling from B (Queeney et al., 2021; 2022). However, neglecting off-policy correction works well in practice and does not compromise the stability of the algorithm (Kostrikov et al., 2018). We provide more implementation details and the complete pseudo-code for our algorithm in Appendix D. BC is trained offline using expert observation-action pairs for 104 gradient steps. All the other algorithms are trained for 3 106 frames in walker run, hopper hop, cheetah run, quadruped run, and quadruped walk, and 106 frames for the other tasks. We run each experiment for 6 seeds. The reward function rχ(z, z ) is defined as in (8), and ψ1 and ψ2 are the slow moving weights for the target Q networks.