reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents

Authors: Timothée Mathieu, Matheus Medeiros Centa, Riccardo Della Vecchia, Hector Kohler, Alena Shilova, Odalric-Ambrym Maillard, Philippe Preux

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate the effectiveness of Ada Stop in various use-cases, including toy examples and Deep RL algorithms on challenging Mujoco environments. Ada Stop is the first statistical test fitted to this sort of comparisons: it is both a significant contribution to statistics, and an important contribution to computational studies performed in reinforcement learning and in other domains.
Researcher Affiliation	Academia	Timothée Mathieu EMAIL Inria, Université de Lille, CNRS, Centrale Lille, UMR 9189 CRISt AL Riccardo Della Vecchia EMAIL Inria, Université de Lille, CNRS, Centrale Lille, UMR 9189 CRISt AL Alena Shilova EMAIL Inria, Université de Lille, CNRS, Centrale Lille, UMR 9189 CRISt AL Matheus Medeiros Centa EMAIL Université de Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL Hector Kohler EMAIL Université de Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL Odalric-Ambrym Maillard EMAIL Inria, Université de Lille, CNRS, Centrale Lille, UMR 9189 CRISt AL Philippe Preux EMAIL Université de Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRISt AL
Pseudocode	Yes	Algorithm 1: Adaptive stopping to compare two RL agents. This algorithm is expressed in the context of the comparison of RL agents. It is easy to adapt to other types of computational agents. Algorithm 2: Multiple testing by step-down permutation test. Algorithm 3: Ada Stop (main algorithm) in the context of the comparison of 2 RL agents. Its application to other types of computational agents is straightforward. Algorithm 4: Early accept.
Open Source Code	Yes	To reproduce the experiments of this paper, the python code is freely available on Git Hub at https: //github.com/Timothee Mathieu/Adaptive_stopping_MC_RL. In addition, we provide a library and command-line tool that can be used independently: the Ada Stop Python package is available at https: //github.com/Timothee Mathieu/adastop.
Open Datasets	Yes	We use the Mu Jo Co5 (Todorov et al., 2012) benchmark for high-dimensional continuous control. We use the Gymnasium6 implementation. Similarly to (Colas et al., 2019, Table 15), we compute the empirical statistical power of Ada Stop as a function of the number of scores of the RL agents (Table 2). To compute the empirical statistical power for a given number of scores, we make the hypothesis that the distribution of SAC and TD3 agents scores are different, and we count how many times Ada Stop decides that one agent is MLB than the other (number of true positives). As the test is adaptive, we also report the effective number of scores that are necessary to make a decision with 0.95 confidence level. For each number of scores, we have run Ada Stop 103 times. For example, when comparing the scores of SAC and TD3 on Half Cheetah using Ada Stop with N = 4 and K = 5, the maximum number of scores that is used is N K = 20. However, we observe in Table 2 that when N = 4 and K = 5, Ada Stop can make a decision with a power of 0.82 using only 12 scores. In (Colas et al., 2019, Table 15), the minimum number of scores required to obtain a statistical power of 0.8 when comparing SAC and TD3 agents is 15 when using either a t-test, or a Welch test, or a bootstrapping test. With this example, we first show that being an adaptive test, Ada Stop may save computations. We also show that as long as the scores of agents are made available, Ada Stop can use them to provide a statistically sound conclusion, and as such, Ada Stop may be used to assess the initial conclusions, hopefully strengthening them with a statistically significant argument. 4available at https://github.com/flowersteam/rl_stats/tree/master/data.
Dataset Splits	No	The paper focuses on the number of independent executions (runs/seeds) for evaluating agents, not on conventional train/test/validation splits of a fixed dataset.
Hardware Specification	No	The paper does not mention any specific hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions software libraries like rlberry (Domingues et al., 2021), Stable-Baselines3 (Raffin et al., 2021), Clean RL (Huang et al., 2022), Mushroom RL (D Eramo et al., 2021), and Gymnasium implementation, but does not provide specific version numbers for these software dependencies in the main text or appendix.
Experiment Setup	Yes	For each algorithm, we fix the hyperparameters to those used by the library authors in their benchmarks for one of the Mu Jo Co environments. Appendix H.2 lists the values that were used and we further discuss the experimental setup. H.2 Mu Jo Co Experiments Hyperparameters. Table 4 lists the hyperparameters used for each Deep RL agent on the Mu Jo Co benchmark. Each agent is trained during 106 interactions with its environment in the cases of Half Cheetahv3, Hopper-v3, and Walker2d-v3. It is trained during 2.106 interactions in the cases of Ant-v3 and Humanoidv3. In all cases, a training episode is made of no more than 103 interactions. Scores. According to algorithm 3, after each agent is trained, the resulting policy performs 50 evaluations episodes. The score of the agent is the mean performance on these 50 episodes.