reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reusable Options through Gradient-based Meta Learning

Authors: David Kuric, Herke van Hoof

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In order to evaluate our method empirically and show the benefits of learned terminations as well as end-to-end learning with gradient-based meta-learning, we perform experiments in Taxi and Ant Maze domains.
Researcher Affiliation	Academia	David Kuric EMAIL AMLab, University of Amsterdam Herke van Hoof EMAIL AMLab, University of Amsterdam
Pseudocode	Yes	Algorithm 1: Fast Adaptation of Modular Policies (FAMP)
Open Source Code	Yes	Our code is publicly available at https://github.com/Kuroo/FAMP.
Open Datasets	Yes	In the first set of experiments we use a modified Taxi environment3 (Dietterich, 2000). This environment is commonly used in works on hierarchical learning and options (Dietterich, 2000; Igl et al., 2020) and allows one to create many different tasks with shared parts to test the reusability of learned options. In the second experiment, we demonstrate the applicability of our method to more complex environments with continuous state and action spaces and perform similar ablations. We use the ant maze domain introduced by Frans et al. (2018) whose tasks are shown in Figure 5.
Dataset Splits	Yes	We use 48 combinations as training tasks and 12 as test tasks. We use 9 tasks for training and 4 test tasks.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions existing implementations for baselines and that hyperparameters and details are in Appendix A, but does not list specific versions of software dependencies (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup	Yes	The reward is 2 for reaching the goal and 0.1 per step otherwise. Episodes terminate if they take longer than 1500 timesteps. We use a tabular representations for the policy over options, terminations and subpolicies. We used two hidden layers of 64 nodes to represent the components of hierarchical policies and the policy of PPO. For MAML, we increased the layer sizes to 128. In addition to aforementioned baselines, we have also trained an agent with RL2 (Duan et al., 2016b) on these tasks. Because its performance was comparable to MAML, which is another meta-learning algorithm without options that is directly related to our method, we decided to omit RL2 in the plots to keep them uncluttered. Its meta-training curve can be found in Appendix B. For all baselines, we used existing implementations for training and evaluation. Exact hyperparameters and details can be found in Appendix A.