Using Confounded Data in Latent Model-Based Reinforcement Learning

Authors: Maxime Gasse, Damien GRASSET, Guillaume Gaudron, Pierre-Yves Oudeyer

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our method is correct and efficient, in the sense that it attains better generalization guarantees thanks to the confounded offline data (in the asymptotic case), regardless of the confounding effect (the offline expert s behaviour). We showcase our method on a series of synthetic experiments, which demonstrate that a) using confounded offline data naively degrades the sample-efficiency of an RL agent collecting and learning from online data; b) using confounded offline data correctly improves its sample-efficiency.
Researcher Affiliation Collaboration Maxime Gasse EMAIL Service Now Research Montréal QC, Canada Damien Grasset EMAIL IRT Saint Exupéry Canada Montréal QC, Canada Guillaume Gaudron EMAIL Ubisoft La Forge Bordeaux, France Pierre-Yves Oudeyer EMAIL Inria Bordeaux Sud-Ouest Bordeaux, France
Pseudocode Yes Algorithm 1 Augmented model-based RL pseudocode
Open Source Code Yes The code to reproduce these experiments is available online8. 8Code for the experiments: https://github.com/gasse/causal-rl-tmlr
Open Datasets No We run experiments on the three synthetic toy problems described in figure 6, each expressing a different level of complexity and a different form of privileged information. In tiger, the learning agent receives a noisy signal of the tiger’s position (roar left or roar right). ... In hidden treasures the agent must collect a treasure (+1 reward), which is randomly located in one of the four corners. ... In sloppy dark room the agent must reach a treasure (+1 reward) located behind a wall, and slips to a random adjacent tile instead of moving to the chosen direction 50% of the time.
Dataset Splits No We evaluate on the real test environment 1) the quality of the causal transition model ˆq(ot+1|o0 t, do(a0 t)) in terms of its likelihood on new interventional data (collected via random exploration), and 2) the performance of the resulting policy ˆπstd in terms of its cumulated reward. We evaluate each model and agent over 10K new trajectories, and we repeat each experiment 10 times with different random seeds to account for variability. ... When learning the model, we divide the learning rate by 10 after 10 epochs without loss improvement (reduce on plateau), and we stop training after 20 epochs without improvement (early stopping). We use all available data for training, and we monitor the training loss for early stopping (no validation set).
Hardware Specification No The paper does not provide specific hardware details for running the experiments.
Software Dependencies No We train ˆq via gradient descent using the Adam optimizer [17]... Both the actor and critic consists of a 2-layers perceptron (MLP) with the same hidden layer size... The paper mentions Adam optimizer [17], but no other software with specific version numbers is provided.
Experiment Setup Yes Table 1: Training hyperparameters we used in each experiment. When learning the model, we divide the learning rate by 10 after 10 epochs without loss improvement (reduce on plateau), and we stop training after 20 epochs without improvement (early stopping). We use all available data for training, and we monitor the training loss for early stopping (no validation set).