Safe Model-based Reinforcement Learning with Stability Guarantees

Authors: Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, Andreas Krause

NeurIPS 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show how the resulting algorithm can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down.
Researcher Affiliation Academia Felix Berkenkamp Department of Computer Science ETH Zurich EMAIL Matteo Turchetta Department of Computer Science, ETH Zurich EMAIL Angela P. Schoellig Institute for Aerospace Studies University of Toronto EMAIL Andreas Krause Department of Computer Science ETH Zurich EMAIL
Pseudocode Yes Algorithm 1 SAFELYAPUNOVLEARNING
Open Source Code Yes A Python implementation of Algorithm 1 and the experiments based on Tensor Flow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.
Open Datasets No The paper describes using a 'simulated inverted pendulum benchmark problem' and its dynamics, but does not provide a link, DOI, or formal citation for a publicly available or open dataset.
Dataset Splits No The paper describes a simulated environment and does not specify training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper states that experiments were run on a 'simulated inverted pendulum' and mentions using 'TensorFlow' but does not provide any specific hardware details such as CPU/GPU models, memory, or cloud instance types.
Software Dependencies No The paper mentions 'Tensor Flow [37] and GPflow [38]' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For the policy, we use a neural network with two hidden layers and 32 neurons with Re LU activations each. We compute a conservative estimate of the Lipschitz constant as in [30]. We use standard approximate dynamic programming with a quadratic, normalized cost r(x, u) = x TQx + u TRu, where Q and R are positive-definite, to compute the cost-to-go Jπθ. Specifically, we use a piecewiselinear triangulation of the state-space as to approximate Jπθ, see [39]. We optimize the policy via stochastic gradient descent on (7), where we sample a finite subset of X and replace the integral in (7) with a sum. We verify our approach on an inverted pendulum benchmark problem. The true, continuous-time dynamics are given by ml2 ψ = gml sin(ψ) λ ψ + u, where ψ is the angle, m the mass, g the gravitational constant, and u the torque applied to the pendulum. We use a GP model for the discrete-time dynamics, where the mean dynamics are given by a linearized and discretized model of the true dynamics that considers a wrong, lower mass and neglects friction. We use a combination of linear and Matérn kernels in order to capture the model errors that result from parameter and integration errors. To enable more data-efficient learning, we fix βn = 2.