reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Using Confounded Data in Latent Model-Based Reinforcement Learning

Authors: Maxime Gasse, Damien GRASSET, Guillaume Gaudron, Pierre-Yves Oudeyer

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our method is correct and eﬃcient, in the sense that it attains better generalization guarantees thanks to the confounded oﬄine data (in the asymptotic case), regardless of the confounding eﬀect (the oﬄine expert s behaviour). We showcase our method on a series of synthetic experiments, which demonstrate that a) using confounded oﬄine data naively degrades the sample-eﬃciency of an RL agent collecting and learning from online data; b) using confounded oﬄine data correctly improves its sample-eﬃciency.
Researcher Affiliation	Collaboration	Maxime Gasse EMAIL Service Now Research Montréal QC, Canada Damien Grasset EMAIL IRT Saint Exupéry Canada Montréal QC, Canada Guillaume Gaudron EMAIL Ubisoft La Forge Bordeaux, France Pierre-Yves Oudeyer EMAIL Inria Bordeaux Sud-Ouest Bordeaux, France
Pseudocode	Yes	Algorithm 1 Augmented model-based RL pseudocode
Open Source Code	Yes	The code to reproduce these experiments is available online8. 8Code for the experiments: https://github.com/gasse/causal-rl-tmlr
Open Datasets	No	We run experiments on the three synthetic toy problems described in ﬁgure 6, each expressing a diﬀerent level of complexity and a diﬀerent form of privileged information. In tiger, the learning agent receives a noisy signal of the tiger’s position (roar left or roar right). ... In hidden treasures the agent must collect a treasure (+1 reward), which is randomly located in one of the four corners. ... In sloppy dark room the agent must reach a treasure (+1 reward) located behind a wall, and slips to a random adjacent tile instead of moving to the chosen direction 50% of the time.
Dataset Splits	No	We evaluate on the real test environment 1) the quality of the causal transition model ˆq(ot+1\|o0 t, do(a0 t)) in terms of its likelihood on new interventional data (collected via random exploration), and 2) the performance of the resulting policy ˆπstd in terms of its cumulated reward. We evaluate each model and agent over 10K new trajectories, and we repeat each experiment 10 times with diﬀerent random seeds to account for variability. ... When learning the model, we divide the learning rate by 10 after 10 epochs without loss improvement (reduce on plateau), and we stop training after 20 epochs without improvement (early stopping). We use all available data for training, and we monitor the training loss for early stopping (no validation set).
Hardware Specification	No	The paper does not provide specific hardware details for running the experiments.
Software Dependencies	No	We train ˆq via gradient descent using the Adam optimizer [17]... Both the actor and critic consists of a 2-layers perceptron (MLP) with the same hidden layer size... The paper mentions Adam optimizer [17], but no other software with specific version numbers is provided.
Experiment Setup	Yes	Table 1: Training hyperparameters we used in each experiment. When learning the model, we divide the learning rate by 10 after 10 epochs without loss improvement (reduce on plateau), and we stop training after 20 epochs without improvement (early stopping). We use all available data for training, and we monitor the training loss for early stopping (no validation set).