reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Empirical Design in Reinforcement Learning

Authors: Andrew Patterson, Samuel Neumann, Martha White, Adam White

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyperparameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design.
Researcher Affiliation	Academia	Andrew Patterson EMAIL Samuel Neumann EMAIL Martha White EMAIL Adam White EMAIL Department of Computing Science and Alberta Machine Intelligence Institute (Amii) University of Alberta, Edmonton, Canada Canada CIFAR AI Chair
Pseudocode	No	The paper mentions an 'efficient algorithm for estimating idealized performance (in Algorithm F.3)' in Appendix A, but the actual pseudocode block for Algorithm F.3 is not provided within the text content of the paper.
Open Source Code	No	The paper refers to and links third-party codebases (SAC codebase, RLLab codebase) that were used for experiments in Section 6. However, it does not explicitly state that the authors are releasing their own code for the methodologies or novel findings described in this paper.
Open Datasets	Yes	Classic environments that are still commonly used include Mountain Car (Moore, 1990), Cartpole (Sutton and Barto, 2018), Puddle World (Sutton and Barto, 2018) and Acrobot (Sutton, 1996). ... Two examples of such benchmarks are the Atari suite (Bellemare et al., 2013; Machado et al., 2018) and Mujoco environments (Todorov et al., 2012). ... Examples include Atari (Mnih et al., 2013), Metaworld (Yu et al., 2020) and Mini Grid (Chevalier-Boisvert et al., 2018).
Dataset Splits	No	Unlike supervised learning, reinforcement learning experiments are online and interactive. The agent a program generates its own training data by interacting with the environment another program and the quality of the data depends on what the agent learned previously. This interaction makes fair comparisons and scientiﬁc reproducibility major challenges in reinforcement learning. Many of the ideas from classical machine learning such as train and test splits, overﬁtting, cross-validation, and model selection are either diﬀerent or non-existent in reinforcement learning.
Hardware Specification	No	The paper discusses computational resources generally, mentioning 'significant computational resources' and 'unprecedented compute' but does not provide specific details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions 'scipy' in Figure 4 and 'Matlab, R, and SPSS' in footnote 5, but no specific version numbers are provided for these or any other software libraries or tools used in the experiments. The codebases for SAC and RLLab are mentioned as being used, but their versions are not specified.
Experiment Setup	Yes	For this experiment, we combine the Expected SARSA algorithm (ESARSA) using an ϵ-greedy policy both as the bootstrapping target and as the behavior policy. The agent uses tile-coded features (Sutton and Barto, 2018) mapping the (x, y)-coordinates within the gridworld to a large binary feature vector. The state, action value function estimate is a linear function of the tile-coded features. Hyperparameter Value Tiles 4 Tilings 8 Stepsize 0.1 ϵ 0.2 Experiment length 30k steps γ 0.99