reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Expected Policy Gradients for Reinforcement Learning

Authors: Kamil Ciosek, Shimon Whiteson

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.
Researcher Affiliation	Collaboration	Kamil Ciosek EMAIL Microsoft Research Cambridge, 21 Station Road, Cambridge CB1 2FB, United Kingdom Shimon Whiteson EMAIL Department of Computer Science, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD United Kingdom
Pseudocode	Yes	Algorithm 1 Expected policy gradients Algorithm 2 Gaussian policy gradients Algorithm 3 Gaussian integrals Algorithm 4 Policy gradients with clipped actions.
Open Source Code	No	The paper mentions using "Open AI version of A2C" and "PPO2 version published by Open AI" which are third-party tools used for comparison or as baselines, but it does not provide an explicit statement about releasing the source code for the methodology (EPG) described in this paper, nor does it provide a link to a repository for their implementation.
Open Datasets	Yes	To benchmark our algorithms, we use ﬁve continuous-action domains, modeled with the Mu Jo Co physics simulator (Todorov et al., 2012): Half Cheetah-v2, Inverted Pendulum-v2, Reacher2d-v2, Walker2d-v2, and Inverted Double Pendulum-v2, as well as one discrete-action domain: Atari Pong.
Dataset Splits	No	The paper describes running experiments on reinforcement learning environments (MuJoCo, Atari Pong) and presents learning curves based on '20 runs' and 'thousands of steps'. However, it does not specify explicit train/test/validation dataset splits with percentages, sample counts, or references to predefined split files, which are typical for static dataset evaluations.
Hardware Specification	No	Experiments performed at Oxford were made possible by a generous equipment grant from NVIDIA. This mention of 'NVIDIA' is too general and does not specify particular GPU models, processor types, or other hardware components used for running the experiments.
Software Dependencies	No	The paper mentions using "Py Torch (Paszke et al., 2017)" and "Tensor Flow (Abadi et al., 2015)" but these citations refer to the frameworks' publications, not the specific version numbers used in their experiments. It also mentions "Mu Jo Co environment (version 2)" which is an environment version, not a general software dependency, and "PPO2 version published by Open AI", which is a specific algorithm version, not a software library. Explicit version numbers for key software dependencies like PyTorch or other libraries are not provided.
Experiment Setup	Yes	The hyperparameters for DPG and those of EPG that are not related to exploration were taken from an existing benchmark (Islam et al., 2017; Brockman et al., 2016). They are detailed in Appendix A.4. Our EPG exploration technique has just one hyperparameter σ0 while OU has two (standard deviation and mean reversion constant). We optimized σ0 on the Half Cheetah domain (Figure 12) and settled on the value σ0 = 0.5. Appendix A.4. Experimental Details and Table 2 provide detailed hyperparameters such as 'Target network update constant τ 0.01', 'Size of replay buﬀer 1000000', 'Batch size 64', 'Learning rate 1e-3', and network architecture details like 'hidden layers of 100, 100 neurons respectively, Re LU nonlinearities'.