Proximal Curriculum for Reinforcement Learning Agents
Authors: Georgios Tzannetos, Bárbara Gomes Ribeiro, Parameswaran Kamalaruban, Adish Singla
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on a variety of domains demonstrate the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents. In this section, we evaluate the effectiveness of our curriculum strategies on a variety of domains w.r.t. the uniform performance of the trained RL agent over the training pool of tasks. |
| Researcher Affiliation | Academia | Georgios Tzannetos EMAIL Max Planck Institute for Software Systems Bárbara Gomes Ribeiro EMAIL Max Planck Institute for Software Systems Parameswaran Kamalaruban EMAIL The Alan Turing Institute Adish Singla EMAIL Max Planck Institute for Software Systems |
| Pseudocode | Yes | Algorithm 1 RL Agent Training as Interaction between Teacher-Student Components. Algorithm 2 in the appendix provides a complete pseudo-code for the RL agent training with Pro Cu RL-val in this general setting. |
| Open Source Code | Yes | 1Github repo: https://github.com/machine-teaching-group/tmlr2023_proximal-curriculum-rl. |
| Open Datasets | Yes | Based on the work of Klink et al. (2020b), we consider a contextual Point Mass environment... This environment is the same used in the work of Klink et al. (2020b)... This environment is adapted from the original Mu Jo Co Ant environment (Todorov et al., 2012). |
| Dataset Splits | Yes | For Basic Karel, we have a train and test dataset of 24000 and 2400 tasks, respectively. For Point Mass-S, we constructed a separate test set of 100 tasks by uniformly picking tasks from the task space. We construct the training pool of tasks by uniformly sampling 100 tasks over the space of possible tasks. |
| Hardware Specification | Yes | All the experiments were conducted on a cluster of machines with CPUs of model Intel Xeon Gold 6134M CPU @ 3.20GHz. |
| Software Dependencies | No | Throughout all the experiments, we use the PPO method from Stable-Baselines3 library for policy optimization (Schulman et al., 2017; Raffin et al., 2021). The version of Stable-Baselines3 is not provided. |
| Experiment Setup | Yes | In Figure 6, we report the PPO hyperparameters used in the experiments. For each environment, all the hyperparameters are consistent across all the different curriculum strategies. Figure 6: Different hyperparameters of the PPO method used in the experiments for each environment. |