Improving Large Language Model Planning with Action Sequence Similarity
Authors: Xinran Zhao, Hanie Sedghi, Bernd Bohnet, Dale Schuurmans, Azade Nazi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we observe that commonly used problem similarity may result in false positives with drastically different plans, which can mislead the model. In response, we propose to sample and filter exemplars leveraging plan side action sequence similarity (AS). Our experimental result confirms that GRASE-DC achieves significant performance improvement on various planning tasks (up to ~11-40 point absolute accuracy improvement with 27.3% fewer exemplars needed on average). Extensive analysis validates the consistent performance improvement of GRASE-DC with various backbone LLMs and on both classical planning and natural language planning benchmarks. |
| Researcher Affiliation | Collaboration | Xinran Zhao1,2 , Hanie Sedghi1, Bernd Bohnet1, Dale Schuurmans1, Azade Nova1 1Google Deep Mind, 2Carnegie Mellon University Work done as a student researcher at Google Deep Mind. Correspondance to EMAIL. |
| Pseudocode | No | The paper includes Figure 1 which is an illustration of the GRASE-DC pipeline, but it is a diagram, not a structured pseudocode or algorithm block. The methodology is described in prose within sections 2.2.1 and 2.2.2. |
| Open Source Code | No | The paper does not contain any explicit statements about code availability, links to code repositories, or mention of code in supplementary materials for the methodology described. |
| Open Datasets | Yes | Dataset and LLM Backbone. Similar to existing work (Valmeekam et al., 2023a; Bohnet et al., 2024), we conduct our main experiments on data collected by or created from the pipeline of (Aeronautiques et al., 1998; Höller et al., 2020). Specifically, we conduct experiments on four PDDL tasks: Blocks World, Minigrid, Logistics, and Tetris, with details in Appendix A.1. For natural language planning, we conduct experiments on Trip Planning (Zheng et al., 2024). |
| Dataset Splits | Yes | We use 300 test examples for each task and the originally provided training set as our exemplar candidates. For each test example, there is an Oracle test plan given, which is a valid plan and satisfies the goal in the task description, but it is not necessarily the only viable plan. |
| Hardware Specification | Yes | We conduct our experiments on a machine with 8 Nvidia A6000 (40G) GPUs with CUDA 12 installed with inference structure built upon v LLM (Kwon et al., 2023). |
| Software Dependencies | Yes | We conduct our experiments on a machine with 8 Nvidia A6000 (40G) GPUs with CUDA 12 installed with inference structure built upon v LLM (Kwon et al., 2023). |
| Experiment Setup | Yes | If the use of backbone LLM is not specified, we use Gemini 1.5 Pro (Gemini Team et al., 2024) as the default to generate plans at test time. We also experiment with other commercial and open-source LLMs, including GPT-4-Turbo (Achiam et al., 2023), Claude-3.0-Opus (Anthropic, 2024), and LLama 3.1 (Dubey et al., 2024) with different parameter sizes (results are in Section 3.3). If applicable, we set the max output token to be 1,600. For the MLP in the main paper, we initialize the network with 2 hidden layers with 400 neurons per layer. We use Adam as the optimizer with a learning rate 1e-5. The whole training process is not sensitive to hyperparameter settings in our experiments. We train our MLP for Blocksworld with 400 exemplar candidates, which leads to 400 x 400 pairs as data points. We use the simplest prompt to show clearly present the effect of exemplars in ICL. In detail, our prompt is ( Please solve the problem:{task}; Your plan as plain text without formatting:{plan}; done. ). |