reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLMs Can Plan Only If We Tell Them

Authors: Bilgehan Sel, Ruoxi Jia, Ming Jin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experimentation and analysis, we seek to answer two critical questions: ... Our experiments demonstrate the effectiveness of the Ao T+ methodology across a range of challenging planning and reasoning tasks. Table 3 presents a comprehensive comparison of our approach against other methods, including Chain-of-Thought (Co T), LLM-Modulo, and with various LLM architectures.
Researcher Affiliation	Academia	Bilgehan Sel Department of ECE Virginia Tech Blacksburg, VA 24061, USA EMAIL; Ruoxi Jia Department of ECE Virginia Tech Blacksburg, VA 24061, USA EMAIL; Ming Jin Department of ECE Virginia Tech Blacksburg, VA 24061, USA EMAIL
Pseudocode	No	The paper describes prompting methodologies and provides examples of prompts in Appendix B, but does not contain structured pseudocode or algorithm blocks with numbered steps or code-like formatting for its proposed methods (Ao T+).
Open Source Code	No	The paper states "The supplementary material includes implementation code for Tree-Planner in the Blocksworld environment." in Appendix A.5, referring to a baseline method. However, there is no explicit statement or link providing access to the source code for the proposed Ao T+ methodology itself within the paper.
Open Datasets	Yes	Our problem setups closely follow those in Valmeekam et al. (2023) for Blocksworld and Logistics, and Qiu et al. (2023) for ACRE and List Functions. ... The List Functions dataset (Rule, 2020) evaluates an LLM’s ability to induce rules that transform input lists into output lists. ... The Abstract Causal REasoning (ACRE) dataset (Zhang et al., 2021) tests an LLM’s ability to identify causal relationships.
Dataset Splits	No	The paper states, "Our problem setups closely follow those in Valmeekam et al. (2023) for Blocksworld and Logistics, and Qiu et al. (2023) for ACRE and List Functions." While it refers to other papers for problem setups, it does not explicitly provide specific training/test/validation dataset split information within its own text.
Hardware Specification	No	The paper mentions using various Large Language Models (LLMs) such as "GPT-4", "GPT-4o", "Claude", "Gemini 1.5", and "LLa MA 3.1" for experiments, and discusses their "computational capacity". However, it does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run these experiments.
Software Dependencies	No	The paper mentions using "PDDL" to formalize instances and check validity for planning problems, and refers to various LLM architectures like "GPT-4", "Claude", "Gemini", and "LLa MA 3.1". However, it does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages, libraries, frameworks, or solvers) used for implementing their methodology.
Experiment Setup	Yes	For Self-Refine (Madaan et al., 2024). We adhered to their original hyperparameters (Temperature = 0.7) but extended the maximum iterations from 4 to 10 to ensure fair comparison with our other baseline, Tree-Planner (Hu et0al., 2023), which employs a maximum of 10 corrections (SF = 10). ... We maintained the structural hyperparameters from the original implementation [of Tree-Planner], with N = 25 initial samples and a maximum of 10 error corrections.