Deconfounding Imitation Learning with Variational Inference

Authors: Risto Vuorio, Pim De Haan, Johann Brehmer, Hanno Ackermann, Daniel Dijkman, Taco Cohen

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To test our method in practice, we conduct experiments in the multi-armed bandit problem from Ortega et al. (2021) and in multiple control environments. We aim to answer three questions: 1) How big is the effect of confounding on naive BC large enough to justify the use of specialized methods? 2) Is our algorithm capable of identifying the interventional policy? 3) How well does the interventional policy imitate the expert? ... Figure 4: Imitation learning in a multi-armed bandit problem. ... Figure 5: Experiments in our confounded, stochastic environments.
Researcher Affiliation Collaboration Risto Vuorio 1,2, Pim de Haan 2,3, Johann Brehmer2, Hanno Ackermann2, Daniel Dijkman2, and Taco Cohen2 1University of Oxford 2Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. 3QUVA Lab, University of Amsterdam
Pseudocode Yes We show the pseudocode for the full training algorithm in Appendix B. ... We summarize the test-time behavior in pseudocode in Appendix B. (In Appendix B: Algorithm 1: Training deconfounded imitators, Algorithm 2: Deconfounded imitators at test time, Algorithm 3: Training deconfounded imitators, offline variant)
Open Source Code No We implemented GAIL closely following a popular publicly available implementation1 and using recurrent PPO by Raffin et al. (2021) as the RL algorithm. ... 1https://github.com/Human Compatible AI/imitation ... The experts are trained using a PPO implementation by Raffin et al. (2021) with hyperparameters from Raffin (2020). (The text refers to third-party code used, but not the authors' own code for their method.)
Open Datasets Yes For Lunar Lander-v2 (Brockman et al., 2016), we consider a modified version with unknown key bindings... For Half Cheetah Bullet Env-v0 (Coumans & Bai, 2016 2021), we modify the environment... In Ant Goal-v0 (Todorov et al., 2012), we consider a version, where the task is to run to a goal...
Dataset Splits No we generate new training data from the expert for each update of the learning algorithms. ... In order to avoid finite-sample-size effects, we use an infinite-size training dataset by generating expert trajectories on the fly. (The paper describes generating data on the fly rather than using predefined dataset splits.)
Hardware Specification No The paper does not provide specific hardware details (like GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies Yes All networks are optimized using the Adam optimizer (Kingma & Ba, 2015) with default settings from Py Torch (Paszke et al., 2019)... The networks are optimized with Adam W (Loshchilov & Hutter, 2017). The experts are trained using a PPO implementation by Raffin et al. (2021) with hyperparameters from Raffin (2020).
Experiment Setup Yes Table 1: Hyperparameters for the deconfounded behavioral cloning and naive behavioral cloning algorithms ... Table 2: Hyperparameters for the deconfounded BC, DAgger, and naive BC algorithms for Lunar Lander-v2, Half Cheetah Bullet Env-v0, and Ant Goal-v0 environments. ... Table 3: Hyperparameter settings for GAIL.