reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contrastive Explanations of Plans through Model Restrictions

Authors: Benjamin Krarup, Senka Krivic, Daniele Magazzeni, Derek Long, Michael Cashmore, David E. Smith

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation falls into two parts: we evaluate the performance of the compilation of constraints by examining the planning time and plan quality produced for a large sample of problems, and we also present the user study that explores the value of the iterative process of plan explanation. The latter evaluation is based on observed interactions with an implemented system and is, therefore, more qualitative in style than the former evaluation. Nevertheless, both evaluations together serve to support our claims that the approach we have described provides a paradigm that allows users to usefully explore explanations of plans, by asking contrastive questions and being supplied plans in response to the constraints implied by those questions. ... We conducted a study with 20 volunteers (5 students, 4 engineers, 4 software developers, 3 researchers, 2 assistant professors, a chemist and a copywriter) divided in two groups with 10 persons each. Participants ages ranged from 23 to 43 years, with 35% identifying as female and 65% identifying as male. The average time G1 spent with the plan and the framework is 24.2 minutes, and on average, they asked 5.1 questions. The average time G2 spent with the plan and the framework is 21.1 minutes, and on average, they asked 3.7 questions as can be seen in Figure 30. The maximum number of question asked was 10, while the minimum number was 1.
Researcher Affiliation	Collaboration	Benjamin Krarup EMAIL Senka Krivic EMAIL Daniele Magazzeni EMAIL Derek Long EMAIL King s College London, Bush House, WC2B 4BG, London, UK Michael Cashmore EMAIL University of Strathclyde, Livingstone Tower, G1 1XH, Glasgow, UK David E. Smith EMAIL PS Research, 25960 Quail Ln, Los Altos Hills, CA 94022, USA
Pseudocode	No	The paper formally defines planning models and compilations using mathematical notation and PDDL2.1 syntax fragments (e.g., Figures 9, 15, 17), but it does not include a dedicated section or block explicitly labeled as "Pseudocode" or "Algorithm" for the overall methodology.
Open Source Code	Yes	All source code and example domain and problem ﬁles are open source and available online: https://github.com/KCL-Planning/XAIPFramework.
Open Datasets	Yes	We used four temporal domains from the recent ICAPS international planning competitions (IPC) (Long & Fox, 2003) in our experiments. The IPC produces a new set of benchmark domains each year to test the capabilities and progress made by AI planners for diﬀerent types of problems. We selected domains to be varied in what they modelled and the most interesting in terms of explainability. These are the Zeno Travel, Depots (IPC3), Crew Planning and Elevators (IPC8) domains.
Dataset Splits	No	The paper uses problems from established IPC benchmarks, selecting specific problems (e.g., "problem 10 for the Depots domain", "problems 1 to 10 for the Crew Planning domain"). It does not describe splitting a single dataset into training, validation, or test sets with percentages or sample counts, which is the typical meaning of dataset splits.
Hardware Specification	Yes	All tests used a Core i7 1.9GHZ machine, and 16GB of memory.
Software Dependencies	No	The paper mentions several software components like PDDL2.1, POPF, Metric-FF, OPTIC, and VAL, along with Qt-Designer. However, it does not provide specific version numbers for these software packages or for its own framework's dependencies, which is necessary for reproducible software dependency information.
Experiment Setup	Yes	We found that for each of the problems 3 minutes planning time was suﬃcient, other than problem 10 for the Depots domain which required 6 minutes. ... All tests used a Core i7 1.9GHZ machine, limited to ﬁve minutes and 16GB of memory. We increased the planning time from the ﬁrst experiment by two minutes to oﬀer a larger window through which to view any growth trends in the planning time for models with iterated constraints. ... To ensure that the questions made sense, we had to take slightly diﬀerent approaches to generating each question type. For each formal question other than FQ1 and FQ3, the actions were randomly selected from the original plan found from the appropriate model. ... The lower bound was generated using a pseudo-random number generator, constrained to within the original plan time. The upper bound was formed by ﬁrst generating a number between 1.5 and 4 and then multiplying the number by the duration of the selected action.