Safe Model-based Reinforcement Learning with Stability Guarantees
Authors: Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, Andreas Krause
NeurIPS 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show how the resulting algorithm can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down. |
| Researcher Affiliation | Academia | Felix Berkenkamp Department of Computer Science ETH Zurich EMAIL Matteo Turchetta Department of Computer Science, ETH Zurich EMAIL Angela P. Schoellig Institute for Aerospace Studies University of Toronto EMAIL Andreas Krause Department of Computer Science ETH Zurich EMAIL |
| Pseudocode | Yes | Algorithm 1 SAFELYAPUNOVLEARNING |
| Open Source Code | Yes | A Python implementation of Algorithm 1 and the experiments based on Tensor Flow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning. |
| Open Datasets | No | The paper describes using a 'simulated inverted pendulum benchmark problem' and its dynamics, but does not provide a link, DOI, or formal citation for a publicly available or open dataset. |
| Dataset Splits | No | The paper describes a simulated environment and does not specify training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper states that experiments were run on a 'simulated inverted pendulum' and mentions using 'TensorFlow' but does not provide any specific hardware details such as CPU/GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions 'Tensor Flow [37] and GPflow [38]' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For the policy, we use a neural network with two hidden layers and 32 neurons with Re LU activations each. We compute a conservative estimate of the Lipschitz constant as in [30]. We use standard approximate dynamic programming with a quadratic, normalized cost r(x, u) = x TQx + u TRu, where Q and R are positive-definite, to compute the cost-to-go Jπθ. Specifically, we use a piecewiselinear triangulation of the state-space as to approximate Jπθ, see [39]. We optimize the policy via stochastic gradient descent on (7), where we sample a finite subset of X and replace the integral in (7) with a sum. We verify our approach on an inverted pendulum benchmark problem. The true, continuous-time dynamics are given by ml2 ψ = gml sin(ψ) λ ψ + u, where ψ is the angle, m the mass, g the gravitational constant, and u the torque applied to the pendulum. We use a GP model for the discrete-time dynamics, where the mean dynamics are given by a linearized and discretized model of the true dynamics that considers a wrong, lower mass and neglects friction. We use a combination of linear and Matérn kernels in order to capture the model errors that result from parameter and integration errors. To enable more data-efficient learning, we fix βn = 2. |