Subgoal-Guided Policy Heuristic Search with Learned Subgoals
Authors: Jake Tuero, Michael Buro, Levi Lelis
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we demonstrate the sample efficiency our method enables in that it requires substantially fewer node expansions to learn effective policies than other search algorithms trained with the Bootstrap algorithm in a variety of problem domains. We also show that policy tree search algorithms using our subgoal-based policy can learn how to solve problems from domains that HIPS-ε cannot solve. |
| Researcher Affiliation | Academia | 1Department of Computing Science, University of Alberta, Edmonton, Canada 2Alberta Machine Intelligence Institute (Amii), Edmonton, Canada. Correspondence to: Jake Tuero <EMAIL>. |
| Pseudocode | Yes | See Appendix C for its pseudocode. |
| Open Source Code | Yes | The codebase 2 is compiled using the GNU Compiler Collection version 13.3.0, and uses the Py Torch 2.4 C++ frontend (Paszke et al., 2019). 2https://github.com/tuero/subgoal-guided-policy-search |
| Open Datasets | Yes | Craft World: A 14 14 room with various raw materials and workbenches (Andreas et al., 2017). We generate problems with the open-source level generator1 of the procedure detailed by Andreas et al. (2017). 1https://github.com/jacobandreas/psketch/tree/master ... Sokoban: ... We use the Boxoban training and test problems (Guez et al., 2018). ... Sokoban uses the Boxban 4 problems. 4https://github.com/deepmind/boxoban-levels/ |
| Dataset Splits | Yes | Every domain has a disjoint set of 10,000 problem instances to train, 1,000 as validation, and 100 in the test set. |
| Hardware Specification | Yes | All experiments were conducted on an Intel i9-7960X and Nvidia 3090, with 128GB of system memory running Ubuntu 24.04. |
| Software Dependencies | Yes | The codebase 2 is compiled using the GNU Compiler Collection version 13.3.0, and uses the Py Torch 2.4 C++ frontend (Paszke et al., 2019). |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma, 2014), with learning rate of 3E-4 and L2-regularization of 1E-4. The policy and heuristic networks for PHS*(π), Levin TS(π), PHS*(πSG), and Levin TS(πSG) both use 128 Res Net channels, with PHS*(πSG) and Levin TS(πSG) using half the number of blocks (4 versus 8) due to the fact that they both have both a low-level and high-level policy. The VQVAE subgoal generator uses a codebook size of 4, a codebook dimension of size 128, and β = 0.25. |