Data-Efficient Policy Evaluation Through Behavior Policy Search
Authors: Josiah P. Hanna, Yash Chandak, Philip S. Thomas, Martha White, Peter Stone, Scott Niekum
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates.1 Section 9. Empirical Study: This section presents an empirical study of variance reduction through behavior policy search. We design our experiments to answer the following questions: Can behavior policy search with BPG-V and BPG-KL reduce the MSE of batch policy evaluation compared to on-policy estimates in both tabular and continuous domains? |
| Researcher Affiliation | Collaboration | Josiah P. Hanna EMAIL Computer Sciences Department University of Wisconsin Madison Madison, WI, USA Yash Chandak, Philip S. Thomas EMAIL College of Information and Computer Sciences University of Massachusetts Amherst, MA, USA Martha White EMAIL Department of Computing Science University of Alberta and the Alberta Machine Intelligence Institute (Amii) Edmonton, Alberta, CA Peter Stone EMAIL Department of Computer Science The University of Texas at Austin and Sony AI Austin, TX, USA Scott Niekum EMAIL College of Information and Computer Sciences University of Massachusetts Amherst, MA, USA Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. |
| Pseudocode | Yes | Algorithm 1 Behavior Policy Gradient on the Variance Algorithm 2 Behavior Policy Gradient on the KL-Divergence |
| Open Source Code | No | The paper does not contain any explicit statements or links to source code released by the authors for the methodology described in this paper. It mentions using existing tools like RLLAB, Open AI Gym, and Py Bullet, but not code for their own work. |
| Open Datasets | Yes | The first two of these are the continuous control Cart Pole Swing Up and Acrobot tasks implemented within RLLAB (Duan et al., 2016), the third task is the Cart Pole task from Open AI Gym (Brockman et al., 2016), and the final task is the Py Bullet (Coumans and Bai, 2016 2019) variant of the Hopper domain from Open AI gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes an incremental batch policy evaluation setting where trajectories are collected in batches during the experiment (e.g., 'batch sizes of 100 trajectories per iteration for Grid World experiments and size 500 for the continuous control tasks'). It does not define traditional training/test/validation splits of a static dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments. It only describes the algorithms and their application to various tasks. |
| Software Dependencies | No | The paper mentions several software components and algorithms: 'RLLAB (Duan et al., 2016)', 'Open AI Gym (Brockman et al., 2016)', 'Py Bullet (Coumans and Bai, 2016 2019)', 'TRPO algorithm (Schulman et al., 2015)', and 'proximal policy optimization (Schulman et al., 2017)'. However, it does not specify version numbers for any of these components. |
| Experiment Setup | Yes | For Cart Pole Swing Up and Acrobot, πe is a two layer neural network with 32 tanh units per layer that maps the state to the mean of a Gaussian distribution over the continuous action space. For Cart Pole Swing Up, πe was learned using 10 iterations of the TRPO algorithm (Schulman et al., 2015) applied to a randomly initialized policy. For Acrobot, πe was learned using 60 iterations. For Cart Pole and Hopper, πe is a neural network with two layers of 64 tanh hidden units in each layer and is trained using 200 iterations of proximal policy optimization (Schulman et al., 2017). For Cart Pole the network maps the state to a softmax distribution over actions while in Hopper the network maps the state to a Gaussian distribution over the continuous-valued actions. For Cart Pole Swing Up and Acrobot we use l = 50 and γ = 1; Cart Pole and Hopper use l = 200 (with early termination possible) and γ = 1. For step-size selection at each iteration BPG-V and BPG-KL use the largest possible step-size subject to a constraint on the KL-divergence between the old and new policy. ... Each method uses a step-size of 5 * 10^-5. ... We use batch sizes of 100 trajectories per iteration for Grid World experiments and size 500 for the continuous control tasks. |