Synthesizing world models for bilevel planning

Authors: Zergham Ahmed, Joshua B. Tenenbaum, Chris Bates, Samuel J. Gershman

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that this approach can be successfully applied to diverse and challenging grid-world games, where approaches based on directly synthesizing a policy perform poorly. Ablation studies demonstrate the benefits of using hierarchical abstractions. 4 Experiments
Researcher Affiliation Academia Zergham Ahmed EMAIL Department of Computer Science Harvard University Joshua B. Tenenbaum EMAIL Department of Brain and Cognitive Sciences MIT Christopher J. Bates EMAIL Institute for Human and Machine Cognition, Harvard University Samuel J. Gershman EMAIL Department of Psychology and Center for Brain Science Harvard University
Pseudocode Yes Algorithm 1 Theory Coder Input: PDDL domain file D, PDDL problem file P, LLM, initial state s0, action space A Output: Low-level plan π = a1, a2, . . . , a N
Open Source Code Yes 1Code is available at https://github.com/Zergham Ahmed/Theory Coder.
Open Datasets Yes We use games suitable to test LLM-style agents that have text-based state representations available including Baba is You (Oy, 2019) and Baby AI (Chevalier-Boisvert et al., 2019). We use the Keke Competition (Charity & Togelius, 2022) version of Baba is You. ... Baby AI is built on top of Minigrid (Chevalier-Boisvert et al., 2018; 2023).
Dataset Splits No The paper evaluates on various levels within game environments (e.g., Baba is You, Baby AI, Sokoban). It mentions training agents using a 'curriculum' (Table 7) and gathering an initial 'replay buffer of experience' (Section 3.4) but does not provide specific percentages or counts for training, validation, or test splits of any explicit dataset in the conventional sense.
Hardware Specification No The paper mentions using GPT-4o and GPT-o1 for generating and refining programs via API calls, but it does not specify the hardware (e.g., GPU/CPU models, memory) used to run the experiments for Theory Coder's own components (like the PDDL planner or BFS) or the LLM inference itself.
Software Dependencies No The paper mentions using 'Python' and the 'Fast-Downward' PDDL planner, but it does not provide specific version numbers for these software components. It specifies 'PDDL 1.2', but this refers to the language specification, not a software dependency with an installable version.
Experiment Setup Yes We set the temperature to 0.7 and all other hyperparameters are set to default in the API call. ... We use a maximum depth of 30 for their MCTS planner as well as a max LLM synthesis request budget of 50.