Reusable Options through Gradient-based Meta Learning
Authors: David Kuric, Herke van Hoof
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In order to evaluate our method empirically and show the benefits of learned terminations as well as end-to-end learning with gradient-based meta-learning, we perform experiments in Taxi and Ant Maze domains. |
| Researcher Affiliation | Academia | David Kuric EMAIL AMLab, University of Amsterdam Herke van Hoof EMAIL AMLab, University of Amsterdam |
| Pseudocode | Yes | Algorithm 1: Fast Adaptation of Modular Policies (FAMP) |
| Open Source Code | Yes | Our code is publicly available at https://github.com/Kuroo/FAMP. |
| Open Datasets | Yes | In the first set of experiments we use a modified Taxi environment3 (Dietterich, 2000). This environment is commonly used in works on hierarchical learning and options (Dietterich, 2000; Igl et al., 2020) and allows one to create many different tasks with shared parts to test the reusability of learned options. In the second experiment, we demonstrate the applicability of our method to more complex environments with continuous state and action spaces and perform similar ablations. We use the ant maze domain introduced by Frans et al. (2018) whose tasks are shown in Figure 5. |
| Dataset Splits | Yes | We use 48 combinations as training tasks and 12 as test tasks. We use 9 tasks for training and 4 test tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions existing implementations for baselines and that hyperparameters and details are in Appendix A, but does not list specific versions of software dependencies (e.g., Python, PyTorch, TensorFlow versions or other libraries). |
| Experiment Setup | Yes | The reward is 2 for reaching the goal and 0.1 per step otherwise. Episodes terminate if they take longer than 1500 timesteps. We use a tabular representations for the policy over options, terminations and subpolicies. We used two hidden layers of 64 nodes to represent the components of hierarchical policies and the policy of PPO. For MAML, we increased the layer sizes to 128. In addition to aforementioned baselines, we have also trained an agent with RL2 (Duan et al., 2016b) on these tasks. Because its performance was comparable to MAML, which is another meta-learning algorithm without options that is directly related to our method, we decided to omit RL2 in the plots to keep them uncluttered. Its meta-training curve can be found in Appendix B. For all baselines, we used existing implementations for training and evaluation. Exact hyperparameters and details can be found in Appendix A. |