ACPBench: Reasoning About Action, Change, and Planning

Authors: Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive evaluation of 21 LLMs and Open AI o1 reasoning models highlight the significant gap in the reasoning capability of the LLMs. Our findings with Open AI o1, a multi-turn reasoning model, reveal significant gains in performance on multiplechoice questions, yet surprisingly, no notable progress is made on boolean questions. We evaluate performance of Open AI o1 reasoning model and 21 state-of-the-art language models... on the ACPBench. We conduct ablation studies as follows: (a) to understand effects of in-context example and COT
Researcher Affiliation Industry Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi IBM Research EMAIL, EMAIL
Pseudocode No The paper describes the methodology and tasks in prose, and includes figures for performance and examples, but no structured pseudocode or algorithm blocks are explicitly presented or labeled.
Open Source Code No The paper provides a link to a dataset: 'Dataset https://ibm.github.io/ACPBench'. While this link is to a GitHub domain, it is explicitly labeled as the dataset for the benchmark, and there is no explicit statement or separate link provided for the source code of the methodology, such as the fine-tuning code or evaluation scripts used in the experiments.
Open Datasets Yes Dataset https://ibm.github.io/ACPBench Extended version https://doi.org/10.48550/ar Xiv.2410.05669
Dataset Splits Yes We use 25 PDDL problem files of varying sizes per domain. These 25 tasks are partitioned into a training and a test set. ... But to keep the test set of reasonable size, we generate only 10 examples per domain, per task.
Hardware Specification Yes All LLMs were either accessed using API or hosted locally using hugging face transformer library on machines with 2 A100 80 GB GPU. ... We finetuned Granite-code 8B available on Hugging Face with two A100 80GB GPUs.
Software Dependencies No The paper mentions using 'hugging face transformer library' and 'classical planners' and 'lifted mutex groups implementations from Fiˇser (2020)', but it does not specify any version numbers for these software components or libraries, which is required for reproducibility.
Experiment Setup Yes We restrict the evaluation to single turn COT prompting with two in-context examples. ... All our LLM experiments had a generated token limit of 1024; Open AI o1 models did not have that limit. ... We compare four prompt styles: (1) IO prompt, (2) Chain-of-Thought prompt without in-context examples (COT), (3) IO prompt with two in-context examples (IO 2-shots), and (4) Chain-of-Thought with two in-context examples (COT 2-shots).