reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deconfounding Imitation Learning with Variational Inference

Authors: Risto Vuorio, Pim De Haan, Johann Brehmer, Hanno Ackermann, Daniel Dijkman, Taco Cohen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To test our method in practice, we conduct experiments in the multi-armed bandit problem from Ortega et al. (2021) and in multiple control environments. We aim to answer three questions: 1) How big is the effect of confounding on naive BC large enough to justify the use of specialized methods? 2) Is our algorithm capable of identifying the interventional policy? 3) How well does the interventional policy imitate the expert? ... Figure 4: Imitation learning in a multi-armed bandit problem. ... Figure 5: Experiments in our confounded, stochastic environments.
Researcher Affiliation	Collaboration	Risto Vuorio 1,2, Pim de Haan 2,3, Johann Brehmer2, Hanno Ackermann2, Daniel Dijkman2, and Taco Cohen2 1University of Oxford 2Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. 3QUVA Lab, University of Amsterdam
Pseudocode	Yes	We show the pseudocode for the full training algorithm in Appendix B. ... We summarize the test-time behavior in pseudocode in Appendix B. (In Appendix B: Algorithm 1: Training deconfounded imitators, Algorithm 2: Deconfounded imitators at test time, Algorithm 3: Training deconfounded imitators, offline variant)
Open Source Code	No	We implemented GAIL closely following a popular publicly available implementation1 and using recurrent PPO by Raffin et al. (2021) as the RL algorithm. ... 1https://github.com/Human Compatible AI/imitation ... The experts are trained using a PPO implementation by Raffin et al. (2021) with hyperparameters from Raffin (2020). (The text refers to third-party code used, but not the authors' own code for their method.)
Open Datasets	Yes	For Lunar Lander-v2 (Brockman et al., 2016), we consider a modified version with unknown key bindings... For Half Cheetah Bullet Env-v0 (Coumans & Bai, 2016 2021), we modify the environment... In Ant Goal-v0 (Todorov et al., 2012), we consider a version, where the task is to run to a goal...
Dataset Splits	No	we generate new training data from the expert for each update of the learning algorithms. ... In order to avoid finite-sample-size effects, we use an infinite-size training dataset by generating expert trajectories on the fly. (The paper describes generating data on the fly rather than using predefined dataset splits.)
Hardware Specification	No	The paper does not provide specific hardware details (like GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	Yes	All networks are optimized using the Adam optimizer (Kingma & Ba, 2015) with default settings from Py Torch (Paszke et al., 2019)... The networks are optimized with Adam W (Loshchilov & Hutter, 2017). The experts are trained using a PPO implementation by Raffin et al. (2021) with hyperparameters from Raffin (2020).
Experiment Setup	Yes	Table 1: Hyperparameters for the deconfounded behavioral cloning and naive behavioral cloning algorithms ... Table 2: Hyperparameters for the deconfounded BC, DAgger, and naive BC algorithms for Lunar Lander-v2, Half Cheetah Bullet Env-v0, and Ant Goal-v0 environments. ... Table 3: Hyperparameter settings for GAIL.