reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data-Efficient Policy Evaluation Through Behavior Policy Search

Authors: Josiah P. Hanna, Yash Chandak, Philip S. Thomas, Martha White, Peter Stone, Scott Niekum

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates.1 Section 9. Empirical Study: This section presents an empirical study of variance reduction through behavior policy search. We design our experiments to answer the following questions: Can behavior policy search with BPG-V and BPG-KL reduce the MSE of batch policy evaluation compared to on-policy estimates in both tabular and continuous domains?
Researcher Affiliation	Collaboration	Josiah P. Hanna EMAIL Computer Sciences Department University of Wisconsin Madison Madison, WI, USA Yash Chandak, Philip S. Thomas EMAIL College of Information and Computer Sciences University of Massachusetts Amherst, MA, USA Martha White EMAIL Department of Computing Science University of Alberta and the Alberta Machine Intelligence Institute (Amii) Edmonton, Alberta, CA Peter Stone EMAIL Department of Computer Science The University of Texas at Austin and Sony AI Austin, TX, USA Scott Niekum EMAIL College of Information and Computer Sciences University of Massachusetts Amherst, MA, USA Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work.
Pseudocode	Yes	Algorithm 1 Behavior Policy Gradient on the Variance Algorithm 2 Behavior Policy Gradient on the KL-Divergence
Open Source Code	No	The paper does not contain any explicit statements or links to source code released by the authors for the methodology described in this paper. It mentions using existing tools like RLLAB, Open AI Gym, and Py Bullet, but not code for their own work.
Open Datasets	Yes	The first two of these are the continuous control Cart Pole Swing Up and Acrobot tasks implemented within RLLAB (Duan et al., 2016), the third task is the Cart Pole task from Open AI Gym (Brockman et al., 2016), and the final task is the Py Bullet (Coumans and Bai, 2016 2019) variant of the Hopper domain from Open AI gym (Brockman et al., 2016).
Dataset Splits	No	The paper describes an incremental batch policy evaluation setting where trajectories are collected in batches during the experiment (e.g., 'batch sizes of 100 trajectories per iteration for Grid World experiments and size 500 for the continuous control tasks'). It does not define traditional training/test/validation splits of a static dataset.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments. It only describes the algorithms and their application to various tasks.
Software Dependencies	No	The paper mentions several software components and algorithms: 'RLLAB (Duan et al., 2016)', 'Open AI Gym (Brockman et al., 2016)', 'Py Bullet (Coumans and Bai, 2016 2019)', 'TRPO algorithm (Schulman et al., 2015)', and 'proximal policy optimization (Schulman et al., 2017)'. However, it does not specify version numbers for any of these components.
Experiment Setup	Yes	For Cart Pole Swing Up and Acrobot, πe is a two layer neural network with 32 tanh units per layer that maps the state to the mean of a Gaussian distribution over the continuous action space. For Cart Pole Swing Up, πe was learned using 10 iterations of the TRPO algorithm (Schulman et al., 2015) applied to a randomly initialized policy. For Acrobot, πe was learned using 60 iterations. For Cart Pole and Hopper, πe is a neural network with two layers of 64 tanh hidden units in each layer and is trained using 200 iterations of proximal policy optimization (Schulman et al., 2017). For Cart Pole the network maps the state to a softmax distribution over actions while in Hopper the network maps the state to a Gaussian distribution over the continuous-valued actions. For Cart Pole Swing Up and Acrobot we use l = 50 and γ = 1; Cart Pole and Hopper use l = 200 (with early termination possible) and γ = 1. For step-size selection at each iteration BPG-V and BPG-KL use the largest possible step-size subject to a constraint on the KL-divergence between the old and new policy. ... Each method uses a step-size of 5 * 10^-5. ... We use batch sizes of 100 trajectories per iteration for Grid World experiments and size 500 for the continuous control tasks.