Proximal Curriculum for Reinforcement Learning Agents

Authors: Georgios Tzannetos, Bárbara Gomes Ribeiro, Parameswaran Kamalaruban, Adish Singla

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on a variety of domains demonstrate the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents. In this section, we evaluate the effectiveness of our curriculum strategies on a variety of domains w.r.t. the uniform performance of the trained RL agent over the training pool of tasks.
Researcher Affiliation Academia Georgios Tzannetos EMAIL Max Planck Institute for Software Systems Bárbara Gomes Ribeiro EMAIL Max Planck Institute for Software Systems Parameswaran Kamalaruban EMAIL The Alan Turing Institute Adish Singla EMAIL Max Planck Institute for Software Systems
Pseudocode Yes Algorithm 1 RL Agent Training as Interaction between Teacher-Student Components. Algorithm 2 in the appendix provides a complete pseudo-code for the RL agent training with Pro Cu RL-val in this general setting.
Open Source Code Yes 1Github repo: https://github.com/machine-teaching-group/tmlr2023_proximal-curriculum-rl.
Open Datasets Yes Based on the work of Klink et al. (2020b), we consider a contextual Point Mass environment... This environment is the same used in the work of Klink et al. (2020b)... This environment is adapted from the original Mu Jo Co Ant environment (Todorov et al., 2012).
Dataset Splits Yes For Basic Karel, we have a train and test dataset of 24000 and 2400 tasks, respectively. For Point Mass-S, we constructed a separate test set of 100 tasks by uniformly picking tasks from the task space. We construct the training pool of tasks by uniformly sampling 100 tasks over the space of possible tasks.
Hardware Specification Yes All the experiments were conducted on a cluster of machines with CPUs of model Intel Xeon Gold 6134M CPU @ 3.20GHz.
Software Dependencies No Throughout all the experiments, we use the PPO method from Stable-Baselines3 library for policy optimization (Schulman et al., 2017; Raffin et al., 2021). The version of Stable-Baselines3 is not provided.
Experiment Setup Yes In Figure 6, we report the PPO hyperparameters used in the experiments. For each environment, all the hyperparameters are consistent across all the different curriculum strategies. Figure 6: Different hyperparameters of the PPO method used in the experiments for each environment.