On Instrumental Variable Regression for Deep Offline Policy Evaluation

Authors: Yutian Chen, Liyuan Xu, Caglar Gulcehre, Tom Le Paine, Arthur Gretton, Nando de Freitas, Arnaud Doucet

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of these techniques empirically on a variety of tasks and environments, including Behaviour Suite (BSuite) (Osband et al., 2019) and Deep Mind Control Suite (DM Control) (Tassa et al., 2020). We found experimentally that some of the recent IV techniques such as AGMM display performance on par with state-of-the-art FQE methods.
Researcher Affiliation Collaboration Yutian Chen EMAIL Deep Mind R7, 14-18 Handyside Street King s Cross London N1C 4DN Liyuan Xu EMAIL Gatsby Unit Caglar Gulcehre EMAIL Deep Mind Tom Le Paine EMAIL Deep Mind Arthur Gretton EMAIL Gatsby Unit Nando de Freitas EMAIL Deep Mind Arnaud Doucet EMAIL Deep Mind
Pseudocode No The paper describes various algorithms and methods (e.g., LSTD, Deep IV, KIV, DFIV, GMM, AGMM, ASEM) through mathematical formulations and prose, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes We open-source all our code and datasets at https://github.com/liyuan9988/IVOPEwith ACME
Open Datasets Yes We open-source all our code and datasets at https://github.com/liyuan9988/IVOPEwith ACME We consider a list reinforcement learning environments from two widely used task collections: Behaviour Suite (BSuite) (Osband et al., 2019) and Deep Mind Control Suite (DM Control) (Tassa et al., 2020).
Dataset Splits Yes The dataset is then split randomly into training and validation subsets with a ratio of 9:1. Table 1: BSuite tasks. ... The training and validation data ratio is 9:1. Table 2: DM Control Suite tasks. ... The training and validation data ratio is 9:1.
Hardware Specification No The paper does not provide specific hardware details (like CPU or GPU models, or cloud computing instance types) used for running the experiments. It focuses on the software architecture and hyperparameters.
Software Dependencies No The paper mentions the use of the ACME library (Hoffman et al., 2020), OAdam, and Adam optimizers, but it does not specify version numbers for these or any other software dependencies, such as programming languages or deep learning frameworks.
Experiment Setup Yes We compare a list of representative non-linear IV methods, including Kernel IV (KIV), Deep IV, Deep Feature IV (DFIV) and three adversarial IV methods: Deep GMM, Adversarial GMM Networks (AGMM), Adversarial approach to structural equation models (ASEM). We also include as baselines the deterministic Bellman residual minimization (DBRM) and two variants of the fitted Q evaluation methods with a deterministic (FQE) and distributional (DFQE) Q representation respectively. ... All algorithms except KIV use the same network architecture to estimate the Q function as in the trained agent for a fair comparison. For BSuite tasks, the Q network is an MLP with layer size 50-50-1 and Re LU activation. The input is a concatenation of the flattened observation and one-hot encoding of the discrete action variable. For DM Control tasks, it is an MLP with layer size 512-512-256-1, ELU activation and a layer normalization after the first hidden layer. ... We use OAdam for adversarial methods as suggested by Bennett et al. (2019b); Dikkala et al. (2020) and Adam for other methods. ... We run a thorough hyper-parameter search for every algorithm in every environment. We randomly sample up to 100 hyper-parameter settings for every algorithm and choose the setting with the best metric on a held-out validation dataset. ... Appendix B. Hyper-parameter selection.