Using Confounded Data in Latent Model-Based Reinforcement Learning
Authors: Maxime Gasse, Damien GRASSET, Guillaume Gaudron, Pierre-Yves Oudeyer
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method is correct and efficient, in the sense that it attains better generalization guarantees thanks to the confounded offline data (in the asymptotic case), regardless of the confounding effect (the offline expert s behaviour). We showcase our method on a series of synthetic experiments, which demonstrate that a) using confounded offline data naively degrades the sample-efficiency of an RL agent collecting and learning from online data; b) using confounded offline data correctly improves its sample-efficiency. |
| Researcher Affiliation | Collaboration | Maxime Gasse EMAIL Service Now Research Montréal QC, Canada Damien Grasset EMAIL IRT Saint Exupéry Canada Montréal QC, Canada Guillaume Gaudron EMAIL Ubisoft La Forge Bordeaux, France Pierre-Yves Oudeyer EMAIL Inria Bordeaux Sud-Ouest Bordeaux, France |
| Pseudocode | Yes | Algorithm 1 Augmented model-based RL pseudocode |
| Open Source Code | Yes | The code to reproduce these experiments is available online8. 8Code for the experiments: https://github.com/gasse/causal-rl-tmlr |
| Open Datasets | No | We run experiments on the three synthetic toy problems described in figure 6, each expressing a different level of complexity and a different form of privileged information. In tiger, the learning agent receives a noisy signal of the tiger’s position (roar left or roar right). ... In hidden treasures the agent must collect a treasure (+1 reward), which is randomly located in one of the four corners. ... In sloppy dark room the agent must reach a treasure (+1 reward) located behind a wall, and slips to a random adjacent tile instead of moving to the chosen direction 50% of the time. |
| Dataset Splits | No | We evaluate on the real test environment 1) the quality of the causal transition model ˆq(ot+1|o0 t, do(a0 t)) in terms of its likelihood on new interventional data (collected via random exploration), and 2) the performance of the resulting policy ˆπstd in terms of its cumulated reward. We evaluate each model and agent over 10K new trajectories, and we repeat each experiment 10 times with different random seeds to account for variability. ... When learning the model, we divide the learning rate by 10 after 10 epochs without loss improvement (reduce on plateau), and we stop training after 20 epochs without improvement (early stopping). We use all available data for training, and we monitor the training loss for early stopping (no validation set). |
| Hardware Specification | No | The paper does not provide specific hardware details for running the experiments. |
| Software Dependencies | No | We train ˆq via gradient descent using the Adam optimizer [17]... Both the actor and critic consists of a 2-layers perceptron (MLP) with the same hidden layer size... The paper mentions Adam optimizer [17], but no other software with specific version numbers is provided. |
| Experiment Setup | Yes | Table 1: Training hyperparameters we used in each experiment. When learning the model, we divide the learning rate by 10 after 10 epochs without loss improvement (reduce on plateau), and we stop training after 20 epochs without improvement (early stopping). We use all available data for training, and we monitor the training loss for early stopping (no validation set). |