reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Large Language Model Planning with Action Sequence Similarity

Authors: Xinran Zhao, Hanie Sedghi, Bernd Bohnet, Dale Schuurmans, Azade Nazi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we observe that commonly used problem similarity may result in false positives with drastically different plans, which can mislead the model. In response, we propose to sample and filter exemplars leveraging plan side action sequence similarity (AS). Our experimental result confirms that GRASE-DC achieves significant performance improvement on various planning tasks (up to ~11-40 point absolute accuracy improvement with 27.3% fewer exemplars needed on average). Extensive analysis validates the consistent performance improvement of GRASE-DC with various backbone LLMs and on both classical planning and natural language planning benchmarks.
Researcher Affiliation	Collaboration	Xinran Zhao1,2 , Hanie Sedghi1, Bernd Bohnet1, Dale Schuurmans1, Azade Nova1 1Google Deep Mind, 2Carnegie Mellon University Work done as a student researcher at Google Deep Mind. Correspondance to EMAIL.
Pseudocode	No	The paper includes Figure 1 which is an illustration of the GRASE-DC pipeline, but it is a diagram, not a structured pseudocode or algorithm block. The methodology is described in prose within sections 2.2.1 and 2.2.2.
Open Source Code	No	The paper does not contain any explicit statements about code availability, links to code repositories, or mention of code in supplementary materials for the methodology described.
Open Datasets	Yes	Dataset and LLM Backbone. Similar to existing work (Valmeekam et al., 2023a; Bohnet et al., 2024), we conduct our main experiments on data collected by or created from the pipeline of (Aeronautiques et al., 1998; Höller et al., 2020). Specifically, we conduct experiments on four PDDL tasks: Blocks World, Minigrid, Logistics, and Tetris, with details in Appendix A.1. For natural language planning, we conduct experiments on Trip Planning (Zheng et al., 2024).
Dataset Splits	Yes	We use 300 test examples for each task and the originally provided training set as our exemplar candidates. For each test example, there is an Oracle test plan given, which is a valid plan and satisfies the goal in the task description, but it is not necessarily the only viable plan.
Hardware Specification	Yes	We conduct our experiments on a machine with 8 Nvidia A6000 (40G) GPUs with CUDA 12 installed with inference structure built upon v LLM (Kwon et al., 2023).
Software Dependencies	Yes	We conduct our experiments on a machine with 8 Nvidia A6000 (40G) GPUs with CUDA 12 installed with inference structure built upon v LLM (Kwon et al., 2023).
Experiment Setup	Yes	If the use of backbone LLM is not specified, we use Gemini 1.5 Pro (Gemini Team et al., 2024) as the default to generate plans at test time. We also experiment with other commercial and open-source LLMs, including GPT-4-Turbo (Achiam et al., 2023), Claude-3.0-Opus (Anthropic, 2024), and LLama 3.1 (Dubey et al., 2024) with different parameter sizes (results are in Section 3.3). If applicable, we set the max output token to be 1,600. For the MLP in the main paper, we initialize the network with 2 hidden layers with 400 neurons per layer. We use Adam as the optimizer with a learning rate 1e-5. The whole training process is not sensitive to hyperparameter settings in our experiments. We train our MLP for Blocksworld with 400 exemplar candidates, which leads to 400 x 400 pairs as data points. We use the simplest prompt to show clearly present the effect of exemplars in ICL. In detail, our prompt is ( Please solve the problem:{task}; Your plan as plain text without formatting:{plan}; done. ).