Model-Free Trajectory-based Policy Optimization with Monotonic Improvement

Authors: Riad Akrour, Abbas Abdolmaleki, Hany Abdulsamad, Jan Peters, Gerhard Neumann

JMLR 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental MOTO is experimentally validated on a set of multi-link swing-up tasks and on a robot table tennis task. The experimental section aims at analyzing the proposed algorithm from four different angles: i) the quality of the returned policy comparatively to state-of-the-art trajectory optimization algorithms, ii) the effectiveness of the proposed variance reduction and sample reuse schemes, iii) the contribution of the added entropy constraint during policy updates in finding better local optima and iv) the ability of the algorithm to scale to higher dimensional problems. The experimental section concludes with a comparison to TRPO (Schulman et al., 2015), a state-of-the-art reinforcement learning algorithm.
Researcher Affiliation Collaboration Riad Akrour1 EMAIL Abbas Abdolmaleki2 EMAIL Hany Abdulsamad1 EMAIL Jan Peters1,3 EMAIL Gerhard Neumann1,4 EMAIL 1CLAS/IAS, Technische Universität Darmstadt, Hochschulstr. 10, D-64289 Darmstadt, Germany 2Deep Mind, London N1C 4AG, UK 3Max Planck Institute for Intelligent Systems, Max-Planck-Ring 4, Tübingen, Germany 4L-CAS, University of Lincoln, Lincoln LN6 7TS, UK
Pseudocode Yes Algorithm 1 Model-Free Trajectory-based Policy Optimization (MOTO) Input: Initial policy π0, number of trajectories per iteration M, step-size ϵ and entropy reduction rate β0 Output: Policy πN for i = 0 to N − 1 do Sample M trajectories from πi for t = T to 1 do Estimate state distribution ρi t (Sec. 4.3) Compute IW for all (s, a, s′) D1:i (Sec. 4.2) Estimate the Q-Function Qi t (Sec. 4.1) Optimize: (η∗, ω∗) = arg min gi t(η, ω) (Sec. 3.3) Update πi+1 t using η∗, ω∗, ρi t and Qi t (Sec. 3.2) end for end for
Open Source Code No The paper mentions Openai baselines. https://github.com/openai/baselines, 2017. However, this refers to a third-party tool used for comparison (TRPO baseline) and not the authors' own code for the methodology described in this paper. There is no explicit statement or link for the authors' code.
Open Datasets No The paper describes simulated tasks such as "multi-link swing-up tasks" and "robot table tennis task" where data is generated through simulation. It does not use or provide access to any specific publicly available datasets with a link, DOI, repository, or formal citation.
Dataset Splits No The paper describes using "M rollouts" for sampling and discusses sample reuse from different time-steps and iterations. However, it does not specify traditional training/validation/test splits with percentages, sample counts, or citations to predefined splits for any fixed dataset. The data is dynamically generated via rollouts in a reinforcement learning setting, not a static dataset with defined splits.
Hardware Specification No The paper states: "Computing time for the experiments was granted from Lichtenberg cluster." This is a general reference to a computing cluster but does not provide specific hardware details such as GPU models, CPU types, or memory amounts.
Software Dependencies No The paper mentions "Open AI’s baselines implementation (Dhariwal et al., 2017)" when comparing to TRPO, which refers to a software project. However, it does not provide specific version numbers for this or any other software components (e.g., Python version, library versions like PyTorch, TensorFlow, etc.) needed to replicate the experiments.
Experiment Setup Yes Fig. 1.b compares GPS to two configurations of MOTO on the double-link swing up task. The same initial policy and step-size ϵ are used by both algorithm. However, we found that GPS performs better with a smaller initial variance, as otherwise actions quickly hit the torque limits making the dynamics modeling harder. Fig. 1.b shows that even if the dynamics of the system are not linear, GPS manages to improve the policy return, and eventually finds a swing-up policy. The two configurations of MOTO have an entropy reduction constant β0 of .1 and .5. ... We chose TRPO as our reinforcement learning baseline for its stateof-the-art performance and because of its similar policy update than that of MOTO (both bound the KL between successive policies). Three variants of TRPO are considered while for MOTO, we refrain from using importance sampling (Sec. 4.2) since similar techniques such as off-policy policy evaluation can be used for TRPO. First, MOTO is compared to a default version of TRPO using Open AI’s baselines implementation (Dhariwal et al., 2017) where TRPO optimizes a neural network for both learning the policy and the V-Function. Default parameters are used except for the KL divergence constraint where we set ϵ = .1 for TRPO to match MOTO’s setting.