reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Proximal Curriculum for Reinforcement Learning Agents

Authors: Georgios Tzannetos, Bárbara Gomes Ribeiro, Parameswaran Kamalaruban, Adish Singla

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on a variety of domains demonstrate the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents. In this section, we evaluate the effectiveness of our curriculum strategies on a variety of domains w.r.t. the uniform performance of the trained RL agent over the training pool of tasks.
Researcher Affiliation	Academia	Georgios Tzannetos EMAIL Max Planck Institute for Software Systems Bárbara Gomes Ribeiro EMAIL Max Planck Institute for Software Systems Parameswaran Kamalaruban EMAIL The Alan Turing Institute Adish Singla EMAIL Max Planck Institute for Software Systems
Pseudocode	Yes	Algorithm 1 RL Agent Training as Interaction between Teacher-Student Components. Algorithm 2 in the appendix provides a complete pseudo-code for the RL agent training with Pro Cu RL-val in this general setting.
Open Source Code	Yes	1Github repo: https://github.com/machine-teaching-group/tmlr2023_proximal-curriculum-rl.
Open Datasets	Yes	Based on the work of Klink et al. (2020b), we consider a contextual Point Mass environment... This environment is the same used in the work of Klink et al. (2020b)... This environment is adapted from the original Mu Jo Co Ant environment (Todorov et al., 2012).
Dataset Splits	Yes	For Basic Karel, we have a train and test dataset of 24000 and 2400 tasks, respectively. For Point Mass-S, we constructed a separate test set of 100 tasks by uniformly picking tasks from the task space. We construct the training pool of tasks by uniformly sampling 100 tasks over the space of possible tasks.
Hardware Specification	Yes	All the experiments were conducted on a cluster of machines with CPUs of model Intel Xeon Gold 6134M CPU @ 3.20GHz.
Software Dependencies	No	Throughout all the experiments, we use the PPO method from Stable-Baselines3 library for policy optimization (Schulman et al., 2017; Raffin et al., 2021). The version of Stable-Baselines3 is not provided.
Experiment Setup	Yes	In Figure 6, we report the PPO hyperparameters used in the experiments. For each environment, all the hyperparameters are consistent across all the different curriculum strategies. Figure 6: Different hyperparameters of the PPO method used in the experiments for each environment.