LLMs Can Plan Only If We Tell Them
Authors: Bilgehan Sel, Ruoxi Jia, Ming Jin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experimentation and analysis, we seek to answer two critical questions: ... Our experiments demonstrate the effectiveness of the Ao T+ methodology across a range of challenging planning and reasoning tasks. Table 3 presents a comprehensive comparison of our approach against other methods, including Chain-of-Thought (Co T), LLM-Modulo, and with various LLM architectures. |
| Researcher Affiliation | Academia | Bilgehan Sel Department of ECE Virginia Tech Blacksburg, VA 24061, USA EMAIL; Ruoxi Jia Department of ECE Virginia Tech Blacksburg, VA 24061, USA EMAIL; Ming Jin Department of ECE Virginia Tech Blacksburg, VA 24061, USA EMAIL |
| Pseudocode | No | The paper describes prompting methodologies and provides examples of prompts in Appendix B, but does not contain structured pseudocode or algorithm blocks with numbered steps or code-like formatting for its proposed methods (Ao T+). |
| Open Source Code | No | The paper states "The supplementary material includes implementation code for Tree-Planner in the Blocksworld environment." in Appendix A.5, referring to a baseline method. However, there is no explicit statement or link providing access to the source code for the proposed Ao T+ methodology itself within the paper. |
| Open Datasets | Yes | Our problem setups closely follow those in Valmeekam et al. (2023) for Blocksworld and Logistics, and Qiu et al. (2023) for ACRE and List Functions. ... The List Functions dataset (Rule, 2020) evaluates an LLM’s ability to induce rules that transform input lists into output lists. ... The Abstract Causal REasoning (ACRE) dataset (Zhang et al., 2021) tests an LLM’s ability to identify causal relationships. |
| Dataset Splits | No | The paper states, "Our problem setups closely follow those in Valmeekam et al. (2023) for Blocksworld and Logistics, and Qiu et al. (2023) for ACRE and List Functions." While it refers to other papers for problem setups, it does not explicitly provide specific training/test/validation dataset split information within its own text. |
| Hardware Specification | No | The paper mentions using various Large Language Models (LLMs) such as "GPT-4", "GPT-4o", "Claude", "Gemini 1.5", and "LLa MA 3.1" for experiments, and discusses their "computational capacity". However, it does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run these experiments. |
| Software Dependencies | No | The paper mentions using "PDDL" to formalize instances and check validity for planning problems, and refers to various LLM architectures like "GPT-4", "Claude", "Gemini", and "LLa MA 3.1". However, it does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages, libraries, frameworks, or solvers) used for implementing their methodology. |
| Experiment Setup | Yes | For Self-Refine (Madaan et al., 2024). We adhered to their original hyperparameters (Temperature = 0.7) but extended the maximum iterations from 4 to 10 to ensure fair comparison with our other baseline, Tree-Planner (Hu et0al., 2023), which employs a maximum of 10 corrections (SF = 10). ... We maintained the structural hyperparameters from the original implementation [of Tree-Planner], with N = 25 initial samples and a maximum of 10 error corrections. |