reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Smoothing for Path Integral Control

Authors: Dominik Thalmeier, Hilbert J. Kappen, Simone Totaro, Vicenç Gómez

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct numerical experiments on diﬀerent control tasks, which show this accelerative eﬀect in practice. For this we develop an algorithm called ASPIC (Adaptive Smoothing for Path Integral Control) that uses cost smoothing to speed up policy optimization. The algorithm adjusts the smoothing parameter in each step to keep the variance of the gradient estimator at a predeﬁned level. To ensure robust updates of the policy, ASPIC enforces a trust region constraint; similar to Schulman et al. (2015) this is achieved with natural gradient updates and an adaptive stepsize. Like other policy gradient based methods (Williams, 1992; Peters and Schaal, 2008; Schulman et al., 2015; Heess et al., 2017) ASPIC is model-free. Many policy optimization algorithms update the control policy based on a direct optimization of the cost; examples are Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) or Path-Integral Relative Entropy Policy Search (PIREPS) (G omez et al., 2014), where the later is particularly developed for path integral control problems. The main novelty of this work is the application to path integral control problems of the idea of smoothing, as introduced in Chaudhari et al. (2018). This technique outperforms direct cost optimization, achieving faster convergence rates with only a negligible amount of computational overhead. 6. Numerical Experiments We now analyze empirically the convergence speed of policy optimization with and without smoothing and show that smoothing accelerates convergence. For the optimization with smoothing, we use ASPIC (Algorithm 1) and for the optimization without smoothing, we use a version of ASPIC where we replaced the gradient of the smoothed cost with the gradient of the cost itself. We ﬁrst consider a simple linear-quadratic (LQ) control problem and then focus on non-linear control tasks, for which we analyze the dependence of ASPIC on the hyper-parameters. We also compare ASPI to other related RL algorithms. Further details about the numerical experiments are found in Appendix L.
Researcher Affiliation	Academia	Dominik Thalmeier EMAIL Radboud University Nijmegen Nijmegen, The Netherlands Hilbert J. Kappen EMAIL Radboud University Nijmegen Nijmegen, The Netherlands Simone Totaro EMAIL Universitat Pompeu Fabra Barcelona, Spain Vicen c G omez EMAIL Universitat Pompeu Fabra Barcelona, Spain
Pseudocode	Yes	Algorithm 1 ASPIC Adaptive Smoothing for Path Integral Control Require: State cost function V (x, t) control cost parameter γ base policy that deﬁnes uncontrolled dynamics π0 real system or simulator to compute dynamics using a parametrized policy πθ trust region sizes E smoothing strength number of samples per iteration N initialize θ0 n = 0 repeat draw state trajectories τ i, i = 1, . . . , N, using parametrized policy πθn for each sample i compute Sγ puθn (τ i) = P 0<t<T V (xi t, t) + γ log πθn(ai t\|t,xi t) π0(ai t\|t,xi t) {Find minimal α such that KL } α 0 repeat increase α Si α Sγ puθn (τ i) 1 γ+α compute weights wi exp( Si α) normalize weights wi wi P i(wi) compute sample size independent weight entropy KL log N + P i wi log(wi) until KL {whiten the weights} ˆwi wi mean(wi) std(wi) {compute the gradient on the smoothed cost} g P i P t ˆwi θ log πθ(ai t\|t, xi t) θ=θn {compute Fisher matrix} use conjugate gradient to approximate the natural gradient g F = F 1g (Appendix J) do line search to compute step size η such KL(θn\|\|θn+1) = E update parameters θn+1 θn + η g F n = n + 1 until convergence
Open Source Code	No	For reproducibility, the code will be made available upon acceptance of the ﬁnal manuscript
Open Datasets	Yes	The latter was simulated using the Open AI gym (Brockman et al., 2016). For pendulum swing-up and the Acrobot tasks we used time-varying linear feedback controllers, whereas for the 2D Walker task we parametrized the control uθ using a neural network. ... We evaluate the performance of these algorithms on a set of six tasks from Pybullet, an open source real-time physics engine (see Appendix L.5 for more details).
Dataset Splits	No	The paper uses environments from OpenAI Gym and Pybullet, which are simulation environments for reinforcement learning where data is generated through interaction, rather than static datasets with predefined train/test/validation splits. Therefore, the concept of explicit dataset splits in the traditional sense does not directly apply or is not detailed in the paper for reproducibility. The paper describes the control tasks and initial conditions for the simulations.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. It only mentions using general platforms like 'Open AI gym' and 'Pybullet open source engine' for simulations.
Software Dependencies	Yes	Table 1: Hyperparameters for the experiments using Pybullet. ... Python 3.8
Experiment Setup	Yes	6. Numerical Experiments ... Further details about the numerical experiments are found in Appendix L. ... L.1 Linear-Quadratic Control Task ... Batchsize: N = 100, trust region E = 0.1, smoothing strength = 0.2 log 100, conjugate gradient iterations: 2 ... L.2 Pendulum Task ... batchsize: N = 500, trust region E = 0.1, smoothing strength = 0.5 ... L.3 Acrobot Task ... batchsize N = 500, trust region E = 0.1, and smoothing strenght = 0.5 ... L.4 Walker Task ... batchsize N = 100, trust region E = 0.01, smoothing strength = 0.05 log 100, and 10 conjugate gradient iterations. ... Table 1: Hyperparameters for the experiments using Pybullet. Hyperparameters Value Number of rollouts (N) 50 Total number of rollouts 50 000 Smoothing strength ( ) {0.1.0.5} Trust region size (E) {0.025, 0.075} Mini batch size 256 Units per layer 32 Number of hidden layers 1 Learning rate 7e-4 Activation function tanh Action distribution Isotropic Gaussian