reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Planning in the Dark: LLM-Symbolic Planning Pipeline Without Experts

Authors: Sukai Huang, Nir Lipovetzky, Trevor Cohn

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments test the following hypotheses: (H1) Semantic equivalence across different representations, as discussed by Weaver, holds true in our context. (H2) Ambiguity in natural language descriptions leads to multiple interpretations. (H3) Our pipeline produces multiple solvable candidate sets of action schemas and plans without expert intervention, providing users with a range of options. (H4) Our pipeline outperforms direct LLM planning approaches in plan quality, demonstrating the advantage of integrating LLM with symbolic planning method. See Appendix for other experiments outside the scope of these hypotheses.
Researcher Affiliation	Collaboration	Sukai Huang1, Nir Lipovetzky1 and Trevor Cohn1,2* 1The University of Melbourne 2 Google EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology in prose and through diagrams (Figure 3) but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Sino-Huang/Official-LLMSymbolic-Planning-without-Experts
Open Datasets	Yes	For training and calibration of the sentence encoder, we used domains from IPC and PDDLGym (Silver and Chitnis 2020).
Dataset Splits	No	The paper mentions 'test domains' and 'training and calibration of the sentence encoder' but does not provide specific percentages, sample counts, or detailed methodology for dataset splits within those domains.
Hardware Specification	No	The paper mentions support from 'The University of Melbourne s Research Computing Services and the Petascale Campus Initiative' but does not specify exact GPU/CPU models, processor types, or memory amounts used for experiments.
Software Dependencies	No	The paper mentions specific LLM models (GLM), sentence encoders (text-embedding-3-large, sentence-t5-xl, all-roberta-large-v1), and a symbolic planner (DUAL-BWFS) but does not provide version numbers for any software dependencies.
Experiment Setup	Yes	To ensure we explore a wide range of interpretations and effectively cover the user s intent, we utilize multiple LLM instances, denoted as {P 1 LLM, P 2 LLM, ..., P N LLM}, and set their temperature hyperparameter high to encourage diverse outputs.