Robust Reinforcement Learning with Bayesian Optimisation and Quadrature

Authors: Supratik Paul, Konstantinos Chatzilygeroudis, Kamil Ciosek, Jean-Baptiste Mouret, Michael A. Osborne, Shimon Whiteson

JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across different domains show that our algorithms learn robust policies efficiently.
Researcher Affiliation Academia 1 Department of Computer Science, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD 2Inria; Universit e de Lorraine; CNRS 3Department of Engineering Science, University of Oxford
Pseudocode Yes Algorithm 1 ALOQ; Algorithm 2 TALOQ
Open Source Code No The paper discusses the source code of a third-party tool or platform that the authors used ('We used the robot dart wrapper: https://github.com/resibots/robot_dart.'), but does not provide their own implementation code for ALOQ/TALOQ. A provided link leads to a video, not source code.
Open Datasets No The paper uses synthetic test functions and describes experimental setups for a robotic arm and hexapod. While it references prior work and states an archive of policies was created using MAP-Elites (Mouret and Clune, 2015), it does not provide specific access information (link, DOI, repository) for the datasets generated or used in their experiments.
Dataset Splits No The paper describes running experiments and evaluations over a certain number of simulator calls or replicates (e.g., '20 independent runs', '20 random replicates'), but does not explicitly provide details about training, validation, or test dataset splits in terms of percentages, sample counts, or predefined partitions.
Hardware Specification No The paper describes experiments on simulated and physical robotic systems, but it does not provide specific details about the hardware used for running simulations (e.g., CPU, GPU models, memory) or the exact specifications of the physical robots beyond general descriptions like 'robotic arm' or 'hexapod'.
Software Dependencies No The paper mentions using the DART simulator for dynamic physics simulation and the robot dart wrapper, but it does not specify version numbers for these software components or any other key libraries.
Experiment Setup Yes We assume a log-normal hyperprior distribution for all the above hyperparameters. For the variance we use (µ = 0, σ = 1), while for the lengthscales we use (µ = 0, σ = 0.75) across all experiments. For the beta warping parameters we used (µ = 2, σ = 0.5) for all artificial test functions, and (µ = 0, σ = 0.5) for the robotic simulator tasks. We used DIRECT (Jones et al., 1993) to maximise the BO acquisition function αALOQ. We set κ = 1.5 for all three experiments in this section. For TRPO we used a neural net with two hidden layers with 5 units each, and had a KL constraint of 0.01. For REINFORCE we used a linear Gaussian policy and set the learning rate to 10 4.