Autonomous Option Invention for Continual Hierarchical Reinforcement Learning and Planning
Authors: Rashmeet Kaur Nayyar, Siddharth Srivastava
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that the resulting approach effectively learns and transfers abstract knowledge across problem instances, achieving superior sample efficiency compared to state-of-the-art methods. Extensive empirical evaluation across a variety of challenging domains with continuous/hybrid states and discrete actions demonstrates that our approach substantially surpasses SOTA RL baselines in sample efficiency within continual RL settings. |
| Researcher Affiliation | Academia | Autonomous Agents and Intelligent Robots Lab, School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ, USA EMAIL |
| Pseudocode | Yes | Algorithm 1: CHi RP algorithm Algorithm 2: Invention of Abstract Options |
| Open Source Code | Yes | We evaluated CHi RP1 on a diverse suite of challenging domains in continual RL setting. Full details about the used domains and hyperparameters are provided in the extended version of our paper (Nayyar and Srivastava 2024). 1https://github.com/AAIR-lab/CHi RP |
| Open Datasets | Yes | Domains. For our evaluation, we compiled a suite of test domains for continual RL that are amenable to hierarchichal decomposition and challenging for SOTA methods. We then created versions of these problems that are significantly larger than prior investigations to evaluate whether the presented approaches are able to push the limits of scope and scalability of continual RL. Our investigation focused on stochastic versions of the following domains with continuous or hybrid states: (1) Maze World (Ramesh, Tomar, and Ravindran 2019): An agent needs to navigate through randomly placed wall obstacles to reach the goal; (2) Four Rooms World (Sutton, Precup, and Singh 1999): An agent must move within and between rooms via hallways to reach the goal; (3) Office World (Icarte et al. 2018): An agent needs to collect coffee and mail from different rooms and deliver them to an office; (4) Taxi World (Dietterich 2000): A taxi needs to pick up a passenger from its pickup location and drop them off at their destination; (5) Minecraft (James, Rosman, and Konidaris 2022): An agent must find and mine relevant resources, build intermediate tools, and use them to craft an iron or stone axe. |
| Dataset Splits | No | For each domain, 20 tasks are randomly sampled sequentially from a distribution. Each approach is provided a fixed budget of H timesteps per task before moving on to the next task. Due to stochasticity and lack of transition models, a task is considered solved if the agent achieves the goal 90% of the time among 100 independent evaluation runs of the learned policy. We report the fraction of tasks solved within the total allocated timesteps for each approach. The reported timesteps include all the interactions with the environment used for learning state abstractions, option endpoints, and option policies. Results are averaged, and standard deviations are computed from 10 independent trials across the entire problem stream. The paper describes the generation and evaluation of tasks in a continual RL setting rather than providing specific training/validation/test splits of a fixed dataset. While tasks are randomly sampled, this is not a traditional dataset split. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts) are provided in the paper for running the experiments. The paper mentions general computational aspects related to hyper-parameters but not specific hardware. |
| Software Dependencies | No | The paper mentions that the code is open-source via a GitHub link, which implies software dependencies are available in the repository. However, it does not explicitly list any specific software or library names with their version numbers in the main text. |
| Experiment Setup | Yes | Hyperparameters. A key strength of CHi RP over baselines is that it requires only five additional hyperparameters beyond standard RL parameters (e.g., decay, learning rate), unlike SOTA DRL methods that need extensive tuning and significant effort in network architecture design. Throughout our experiments, we intuitively set δthre = 0 and σthre 1 to minimize hyperparameter tuning. These values are robust across domains, preventing options from being too small or numerous. We use a limited set of values for kcap, sfactor, and emax parameters across different domains to adaptively control the training of an option s policy and CAT. All parameters are set to the same values across a continual stream of tasks. Details on the used hyperparameters for CHi RP and the baselines are provided in the extended version. |