Adaptive Smoothing for Path Integral Control
Authors: Dominik Thalmeier, Hilbert J. Kappen, Simone Totaro, Vicenç Gómez
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct numerical experiments on different control tasks, which show this accelerative effect in practice. For this we develop an algorithm called ASPIC (Adaptive Smoothing for Path Integral Control) that uses cost smoothing to speed up policy optimization. The algorithm adjusts the smoothing parameter in each step to keep the variance of the gradient estimator at a predefined level. To ensure robust updates of the policy, ASPIC enforces a trust region constraint; similar to Schulman et al. (2015) this is achieved with natural gradient updates and an adaptive stepsize. Like other policy gradient based methods (Williams, 1992; Peters and Schaal, 2008; Schulman et al., 2015; Heess et al., 2017) ASPIC is model-free. Many policy optimization algorithms update the control policy based on a direct optimization of the cost; examples are Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) or Path-Integral Relative Entropy Policy Search (PIREPS) (G omez et al., 2014), where the later is particularly developed for path integral control problems. The main novelty of this work is the application to path integral control problems of the idea of smoothing, as introduced in Chaudhari et al. (2018). This technique outperforms direct cost optimization, achieving faster convergence rates with only a negligible amount of computational overhead. 6. Numerical Experiments We now analyze empirically the convergence speed of policy optimization with and without smoothing and show that smoothing accelerates convergence. For the optimization with smoothing, we use ASPIC (Algorithm 1) and for the optimization without smoothing, we use a version of ASPIC where we replaced the gradient of the smoothed cost with the gradient of the cost itself. We first consider a simple linear-quadratic (LQ) control problem and then focus on non-linear control tasks, for which we analyze the dependence of ASPIC on the hyper-parameters. We also compare ASPI to other related RL algorithms. Further details about the numerical experiments are found in Appendix L. |
| Researcher Affiliation | Academia | Dominik Thalmeier EMAIL Radboud University Nijmegen Nijmegen, The Netherlands Hilbert J. Kappen EMAIL Radboud University Nijmegen Nijmegen, The Netherlands Simone Totaro EMAIL Universitat Pompeu Fabra Barcelona, Spain Vicen c G omez EMAIL Universitat Pompeu Fabra Barcelona, Spain |
| Pseudocode | Yes | Algorithm 1 ASPIC Adaptive Smoothing for Path Integral Control Require: State cost function V (x, t) control cost parameter γ base policy that defines uncontrolled dynamics π0 real system or simulator to compute dynamics using a parametrized policy πθ trust region sizes E smoothing strength number of samples per iteration N initialize θ0 n = 0 repeat draw state trajectories τ i, i = 1, . . . , N, using parametrized policy πθn for each sample i compute Sγ puθn (τ i) = P 0<t<T V (xi t, t) + γ log πθn(ai t|t,xi t) π0(ai t|t,xi t) {Find minimal α such that KL } α 0 repeat increase α Si α Sγ puθn (τ i) 1 γ+α compute weights wi exp( Si α) normalize weights wi wi P i(wi) compute sample size independent weight entropy KL log N + P i wi log(wi) until KL {whiten the weights} ˆwi wi mean(wi) std(wi) {compute the gradient on the smoothed cost} g P i P t ˆwi θ log πθ(ai t|t, xi t) θ=θn {compute Fisher matrix} use conjugate gradient to approximate the natural gradient g F = F 1g (Appendix J) do line search to compute step size η such KL(θn||θn+1) = E update parameters θn+1 θn + η g F n = n + 1 until convergence |
| Open Source Code | No | For reproducibility, the code will be made available upon acceptance of the final manuscript |
| Open Datasets | Yes | The latter was simulated using the Open AI gym (Brockman et al., 2016). For pendulum swing-up and the Acrobot tasks we used time-varying linear feedback controllers, whereas for the 2D Walker task we parametrized the control uθ using a neural network. ... We evaluate the performance of these algorithms on a set of six tasks from Pybullet, an open source real-time physics engine (see Appendix L.5 for more details). |
| Dataset Splits | No | The paper uses environments from OpenAI Gym and Pybullet, which are simulation environments for reinforcement learning where data is generated through interaction, rather than static datasets with predefined train/test/validation splits. Therefore, the concept of explicit dataset splits in the traditional sense does not directly apply or is not detailed in the paper for reproducibility. The paper describes the control tasks and initial conditions for the simulations. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. It only mentions using general platforms like 'Open AI gym' and 'Pybullet open source engine' for simulations. |
| Software Dependencies | Yes | Table 1: Hyperparameters for the experiments using Pybullet. ... Python 3.8 |
| Experiment Setup | Yes | 6. Numerical Experiments ... Further details about the numerical experiments are found in Appendix L. ... L.1 Linear-Quadratic Control Task ... Batchsize: N = 100, trust region E = 0.1, smoothing strength = 0.2 log 100, conjugate gradient iterations: 2 ... L.2 Pendulum Task ... batchsize: N = 500, trust region E = 0.1, smoothing strength = 0.5 ... L.3 Acrobot Task ... batchsize N = 500, trust region E = 0.1, and smoothing strenght = 0.5 ... L.4 Walker Task ... batchsize N = 100, trust region E = 0.01, smoothing strength = 0.05 log 100, and 10 conjugate gradient iterations. ... Table 1: Hyperparameters for the experiments using Pybullet. Hyperparameters Value Number of rollouts (N) 50 Total number of rollouts 50 000 Smoothing strength ( ) {0.1.0.5} Trust region size (E) {0.025, 0.075} Mini batch size 256 Units per layer 32 Number of hidden layers 1 Learning rate 7e-4 Activation function tanh Action distribution Isotropic Gaussian |