Expected Policy Gradients for Reinforcement Learning
Authors: Kamil Ciosek, Shimon Whiteson
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains. |
| Researcher Affiliation | Collaboration | Kamil Ciosek EMAIL Microsoft Research Cambridge, 21 Station Road, Cambridge CB1 2FB, United Kingdom Shimon Whiteson EMAIL Department of Computer Science, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD United Kingdom |
| Pseudocode | Yes | Algorithm 1 Expected policy gradients Algorithm 2 Gaussian policy gradients Algorithm 3 Gaussian integrals Algorithm 4 Policy gradients with clipped actions. |
| Open Source Code | No | The paper mentions using "Open AI version of A2C" and "PPO2 version published by Open AI" which are third-party tools used for comparison or as baselines, but it does not provide an explicit statement about releasing the source code for the methodology (EPG) described in this paper, nor does it provide a link to a repository for their implementation. |
| Open Datasets | Yes | To benchmark our algorithms, we use five continuous-action domains, modeled with the Mu Jo Co physics simulator (Todorov et al., 2012): Half Cheetah-v2, Inverted Pendulum-v2, Reacher2d-v2, Walker2d-v2, and Inverted Double Pendulum-v2, as well as one discrete-action domain: Atari Pong. |
| Dataset Splits | No | The paper describes running experiments on reinforcement learning environments (MuJoCo, Atari Pong) and presents learning curves based on '20 runs' and 'thousands of steps'. However, it does not specify explicit train/test/validation dataset splits with percentages, sample counts, or references to predefined split files, which are typical for static dataset evaluations. |
| Hardware Specification | No | Experiments performed at Oxford were made possible by a generous equipment grant from NVIDIA. This mention of 'NVIDIA' is too general and does not specify particular GPU models, processor types, or other hardware components used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Py Torch (Paszke et al., 2017)" and "Tensor Flow (Abadi et al., 2015)" but these citations refer to the frameworks' publications, not the specific version numbers used in their experiments. It also mentions "Mu Jo Co environment (version 2)" which is an environment version, not a general software dependency, and "PPO2 version published by Open AI", which is a specific algorithm version, not a software library. Explicit version numbers for key software dependencies like PyTorch or other libraries are not provided. |
| Experiment Setup | Yes | The hyperparameters for DPG and those of EPG that are not related to exploration were taken from an existing benchmark (Islam et al., 2017; Brockman et al., 2016). They are detailed in Appendix A.4. Our EPG exploration technique has just one hyperparameter σ0 while OU has two (standard deviation and mean reversion constant). We optimized σ0 on the Half Cheetah domain (Figure 12) and settled on the value σ0 = 0.5. Appendix A.4. Experimental Details and Table 2 provide detailed hyperparameters such as 'Target network update constant τ 0.01', 'Size of replay buffer 1000000', 'Batch size 64', 'Learning rate 1e-3', and network architecture details like 'hidden layers of 100, 100 neurons respectively, Re LU nonlinearities'. |