reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthesizing world models for bilevel planning

Authors: Zergham Ahmed, Joshua B. Tenenbaum, Chris Bates, Samuel J. Gershman

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that this approach can be successfully applied to diverse and challenging grid-world games, where approaches based on directly synthesizing a policy perform poorly. Ablation studies demonstrate the benefits of using hierarchical abstractions. 4 Experiments
Researcher Affiliation	Academia	Zergham Ahmed EMAIL Department of Computer Science Harvard University Joshua B. Tenenbaum EMAIL Department of Brain and Cognitive Sciences MIT Christopher J. Bates EMAIL Institute for Human and Machine Cognition, Harvard University Samuel J. Gershman EMAIL Department of Psychology and Center for Brain Science Harvard University
Pseudocode	Yes	Algorithm 1 Theory Coder Input: PDDL domain file D, PDDL problem file P, LLM, initial state s0, action space A Output: Low-level plan π = a1, a2, . . . , a N
Open Source Code	Yes	1Code is available at https://github.com/Zergham Ahmed/Theory Coder.
Open Datasets	Yes	We use games suitable to test LLM-style agents that have text-based state representations available including Baba is You (Oy, 2019) and Baby AI (Chevalier-Boisvert et al., 2019). We use the Keke Competition (Charity & Togelius, 2022) version of Baba is You. ... Baby AI is built on top of Minigrid (Chevalier-Boisvert et al., 2018; 2023).
Dataset Splits	No	The paper evaluates on various levels within game environments (e.g., Baba is You, Baby AI, Sokoban). It mentions training agents using a 'curriculum' (Table 7) and gathering an initial 'replay buffer of experience' (Section 3.4) but does not provide specific percentages or counts for training, validation, or test splits of any explicit dataset in the conventional sense.
Hardware Specification	No	The paper mentions using GPT-4o and GPT-o1 for generating and refining programs via API calls, but it does not specify the hardware (e.g., GPU/CPU models, memory) used to run the experiments for Theory Coder's own components (like the PDDL planner or BFS) or the LLM inference itself.
Software Dependencies	No	The paper mentions using 'Python' and the 'Fast-Downward' PDDL planner, but it does not provide specific version numbers for these software components. It specifies 'PDDL 1.2', but this refers to the language specification, not a software dependency with an installable version.
Experiment Setup	Yes	We set the temperature to 0.7 and all other hyperparameters are set to default in the API call. ... We use a maximum depth of 30 for their MCTS planner as well as a max LLM synthesis request budget of 50.