Subgoal-Guided Policy Heuristic Search with Learned Subgoals

Authors: Jake Tuero, Michael Buro, Levi Lelis

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we demonstrate the sample efficiency our method enables in that it requires substantially fewer node expansions to learn effective policies than other search algorithms trained with the Bootstrap algorithm in a variety of problem domains. We also show that policy tree search algorithms using our subgoal-based policy can learn how to solve problems from domains that HIPS-ε cannot solve.
Researcher Affiliation Academia 1Department of Computing Science, University of Alberta, Edmonton, Canada 2Alberta Machine Intelligence Institute (Amii), Edmonton, Canada. Correspondence to: Jake Tuero <EMAIL>.
Pseudocode Yes See Appendix C for its pseudocode.
Open Source Code Yes The codebase 2 is compiled using the GNU Compiler Collection version 13.3.0, and uses the Py Torch 2.4 C++ frontend (Paszke et al., 2019). 2https://github.com/tuero/subgoal-guided-policy-search
Open Datasets Yes Craft World: A 14 14 room with various raw materials and workbenches (Andreas et al., 2017). We generate problems with the open-source level generator1 of the procedure detailed by Andreas et al. (2017). 1https://github.com/jacobandreas/psketch/tree/master ... Sokoban: ... We use the Boxoban training and test problems (Guez et al., 2018). ... Sokoban uses the Boxban 4 problems. 4https://github.com/deepmind/boxoban-levels/
Dataset Splits Yes Every domain has a disjoint set of 10,000 problem instances to train, 1,000 as validation, and 100 in the test set.
Hardware Specification Yes All experiments were conducted on an Intel i9-7960X and Nvidia 3090, with 128GB of system memory running Ubuntu 24.04.
Software Dependencies Yes The codebase 2 is compiled using the GNU Compiler Collection version 13.3.0, and uses the Py Torch 2.4 C++ frontend (Paszke et al., 2019).
Experiment Setup Yes We use the Adam optimizer (Kingma, 2014), with learning rate of 3E-4 and L2-regularization of 1E-4. The policy and heuristic networks for PHS*(π), Levin TS(π), PHS*(πSG), and Levin TS(πSG) both use 128 Res Net channels, with PHS*(πSG) and Levin TS(πSG) using half the number of blocks (4 versus 8) due to the fact that they both have both a low-level and high-level policy. The VQVAE subgoal generator uses a codebook size of 4, a codebook dimension of size 128, and β = 0.25.