Empirical Design in Reinforcement Learning
Authors: Andrew Patterson, Samuel Neumann, Martha White, Adam White
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyperparameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design. |
| Researcher Affiliation | Academia | Andrew Patterson EMAIL Samuel Neumann EMAIL Martha White EMAIL Adam White EMAIL Department of Computing Science and Alberta Machine Intelligence Institute (Amii) University of Alberta, Edmonton, Canada Canada CIFAR AI Chair |
| Pseudocode | No | The paper mentions an 'efficient algorithm for estimating idealized performance (in Algorithm F.3)' in Appendix A, but the actual pseudocode block for Algorithm F.3 is not provided within the text content of the paper. |
| Open Source Code | No | The paper refers to and links third-party codebases (SAC codebase, RLLab codebase) that were used for experiments in Section 6. However, it does not explicitly state that the authors are releasing their own code for the methodologies or novel findings described in this paper. |
| Open Datasets | Yes | Classic environments that are still commonly used include Mountain Car (Moore, 1990), Cartpole (Sutton and Barto, 2018), Puddle World (Sutton and Barto, 2018) and Acrobot (Sutton, 1996). ... Two examples of such benchmarks are the Atari suite (Bellemare et al., 2013; Machado et al., 2018) and Mujoco environments (Todorov et al., 2012). ... Examples include Atari (Mnih et al., 2013), Metaworld (Yu et al., 2020) and Mini Grid (Chevalier-Boisvert et al., 2018). |
| Dataset Splits | No | Unlike supervised learning, reinforcement learning experiments are online and interactive. The agent a program generates its own training data by interacting with the environment another program and the quality of the data depends on what the agent learned previously. This interaction makes fair comparisons and scientific reproducibility major challenges in reinforcement learning. Many of the ideas from classical machine learning such as train and test splits, overfitting, cross-validation, and model selection are either different or non-existent in reinforcement learning. |
| Hardware Specification | No | The paper discusses computational resources generally, mentioning 'significant computational resources' and 'unprecedented compute' but does not provide specific details such as GPU models, CPU types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions 'scipy' in Figure 4 and 'Matlab, R, and SPSS' in footnote 5, but no specific version numbers are provided for these or any other software libraries or tools used in the experiments. The codebases for SAC and RLLab are mentioned as being used, but their versions are not specified. |
| Experiment Setup | Yes | For this experiment, we combine the Expected SARSA algorithm (ESARSA) using an ϵ-greedy policy both as the bootstrapping target and as the behavior policy. The agent uses tile-coded features (Sutton and Barto, 2018) mapping the (x, y)-coordinates within the gridworld to a large binary feature vector. The state, action value function estimate is a linear function of the tile-coded features. Hyperparameter Value Tiles 4 Tilings 8 Stepsize 0.1 ϵ 0.2 Experiment length 30k steps γ 0.99 |