ACPBench: Reasoning About Action, Change, and Planning
Authors: Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive evaluation of 21 LLMs and Open AI o1 reasoning models highlight the significant gap in the reasoning capability of the LLMs. Our findings with Open AI o1, a multi-turn reasoning model, reveal significant gains in performance on multiplechoice questions, yet surprisingly, no notable progress is made on boolean questions. We evaluate performance of Open AI o1 reasoning model and 21 state-of-the-art language models... on the ACPBench. We conduct ablation studies as follows: (a) to understand effects of in-context example and COT |
| Researcher Affiliation | Industry | Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi IBM Research EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology and tasks in prose, and includes figures for performance and examples, but no structured pseudocode or algorithm blocks are explicitly presented or labeled. |
| Open Source Code | No | The paper provides a link to a dataset: 'Dataset https://ibm.github.io/ACPBench'. While this link is to a GitHub domain, it is explicitly labeled as the dataset for the benchmark, and there is no explicit statement or separate link provided for the source code of the methodology, such as the fine-tuning code or evaluation scripts used in the experiments. |
| Open Datasets | Yes | Dataset https://ibm.github.io/ACPBench Extended version https://doi.org/10.48550/ar Xiv.2410.05669 |
| Dataset Splits | Yes | We use 25 PDDL problem files of varying sizes per domain. These 25 tasks are partitioned into a training and a test set. ... But to keep the test set of reasonable size, we generate only 10 examples per domain, per task. |
| Hardware Specification | Yes | All LLMs were either accessed using API or hosted locally using hugging face transformer library on machines with 2 A100 80 GB GPU. ... We finetuned Granite-code 8B available on Hugging Face with two A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions using 'hugging face transformer library' and 'classical planners' and 'lifted mutex groups implementations from Fiˇser (2020)', but it does not specify any version numbers for these software components or libraries, which is required for reproducibility. |
| Experiment Setup | Yes | We restrict the evaluation to single turn COT prompting with two in-context examples. ... All our LLM experiments had a generated token limit of 1024; Open AI o1 models did not have that limit. ... We compare four prompt styles: (1) IO prompt, (2) Chain-of-Thought prompt without in-context examples (COT), (3) IO prompt with two in-context examples (IO 2-shots), and (4) Chain-of-Thought with two in-context examples (COT 2-shots). |