reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ACPBench: Reasoning About Action, Change, and Planning

Authors: Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive evaluation of 21 LLMs and Open AI o1 reasoning models highlight the signiﬁcant gap in the reasoning capability of the LLMs. Our ﬁndings with Open AI o1, a multi-turn reasoning model, reveal signiﬁcant gains in performance on multiplechoice questions, yet surprisingly, no notable progress is made on boolean questions. We evaluate performance of Open AI o1 reasoning model and 21 state-of-the-art language models... on the ACPBench. We conduct ablation studies as follows: (a) to understand effects of in-context example and COT
Researcher Affiliation	Industry	Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi IBM Research EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology and tasks in prose, and includes figures for performance and examples, but no structured pseudocode or algorithm blocks are explicitly presented or labeled.
Open Source Code	No	The paper provides a link to a dataset: 'Dataset https://ibm.github.io/ACPBench'. While this link is to a GitHub domain, it is explicitly labeled as the dataset for the benchmark, and there is no explicit statement or separate link provided for the source code of the methodology, such as the fine-tuning code or evaluation scripts used in the experiments.
Open Datasets	Yes	Dataset https://ibm.github.io/ACPBench Extended version https://doi.org/10.48550/ar Xiv.2410.05669
Dataset Splits	Yes	We use 25 PDDL problem ﬁles of varying sizes per domain. These 25 tasks are partitioned into a training and a test set. ... But to keep the test set of reasonable size, we generate only 10 examples per domain, per task.
Hardware Specification	Yes	All LLMs were either accessed using API or hosted locally using hugging face transformer library on machines with 2 A100 80 GB GPU. ... We ﬁnetuned Granite-code 8B available on Hugging Face with two A100 80GB GPUs.
Software Dependencies	No	The paper mentions using 'hugging face transformer library' and 'classical planners' and 'lifted mutex groups implementations from Fiˇser (2020)', but it does not specify any version numbers for these software components or libraries, which is required for reproducibility.
Experiment Setup	Yes	We restrict the evaluation to single turn COT prompting with two in-context examples. ... All our LLM experiments had a generated token limit of 1024; Open AI o1 models did not have that limit. ... We compare four prompt styles: (1) IO prompt, (2) Chain-of-Thought prompt without in-context examples (COT), (3) IO prompt with two in-context examples (IO 2-shots), and (4) Chain-of-Thought with two in-context examples (COT 2-shots).