Effectively Learning Initiation Sets in Hierarchical Reinforcement Learning
Authors: Akhil Bagaria, Ben Abbatematteo, Omer Gottesman, Matt Corsaro, Sreehari Rammohan, George Konidaris
NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that our method learns higher-quality initiation sets faster than existing methods (in MINIGRID and MONTEZUMA S REVENGE), can automatically discover promising grasps for robot manipulation (in ROBOSUITE), and improves the performance of a state-of-the-art option discovery method in a challenging maze navigation task in Mu Jo Co. |
| Researcher Affiliation | Collaboration | Akhil Bagaria Brown University Providence, RI, USA. EMAIL Ben Abbatematteo Brown University, Providence, RI, USA. EMAIL Omer Gottesman Amazon, New York, NY, USA. EMAIL Matt Corsaro Brown University, Providence, RI, USA. EMAIL Sreehari Rammohan Brown University, Providence, RI, USA. EMAIL George Konidaris Brown University, Providence, RI, USA. EMAIL |
| Pseudocode | Yes | Algorithm 1 is the pseudocode used for the experiments described in Section 4.1. Algorithm 2 Robust DSC Rollout Algorithm 3 Robust DSC Algorithm |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper, nor does it explicitly state that the code is released. |
| Open Datasets | Yes | MINIGRID-FOURROOMS [Chevalier-Boisvert et al., 2018] and the first screen of MONTEZUMA S REVENGE [Bellemare et al., 2013]. We use three constrained manipulation tasks in ROBOSUITE [Zhu et al., 2020]. We use the ANT MEDIUM MAZE environment [Fu et al., 2020, Todorov et al., 2012]. |
| Dataset Splits | Yes | The agent is evaluated by rolling out the learned policy once every 10 episodes; during evaluation, the agent starts from a small region around (0, 0), during training it starts at a location randomly sampled from the open locations in the maze. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | Option policies are learned using Rainbow [Hessel et al., 2018] when the action-space is discrete and TD3 [Fujimoto et al., 2018] when it is continuous. ... The IVF is learned using Fitted Q-Evaluation [Le et al., 2019], prioritized experience replay [Schaul et al., 2016] and target networks [Mnih et al., 2015]. The paper lists software components but does not specify their version numbers. |
| Experiment Setup | Yes | Implementation Details. Option policies are learned using Rainbow [Hessel et al., 2018] when the action-space is discrete and TD3 [Fujimoto et al., 2018] when it is continuous. ... The IVF Q-function and initiation classifier are parameterized using neural networks that have the same architecture as the Rainbow/TD3. Each option has a gestation period of 5 [Konidaris and Barto, 2009]. ... Their hyperparameters (Tables 2 and 5) were not tuned and are either identical to the original paper implementation or borrowed from Bagaria et al. [2021a]. The bonus scale c (described in Sec 3.3) was tuned over the set {0.05, 0.1, 0.25, 0.5, 1.0}, the best performing hyperparameters are listed in Table 3. |