reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Explainable Goal Recognition Using Weight of Evidence (WoE): A Human-Centered Approach

Authors: Abeer Alshehri, Amal Abdulrahman, Hajar Alamri, Tim Miller, Mor Vered

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the model computationally across eight GR benchmarks and through three user studies. The first study assesses the efficiency of generating human-like explanations within the Sokoban game domain, the second examines perceived explainability in the same domain, and the third evaluates the model s effectiveness in aiding decision-making in illegal fishing detection. Results demonstrate that the XGR model significantly enhances user understanding, trust, and decision-making compared to baseline models, underscoring its potential to improve human-agent collaboration.
Researcher Affiliation	Academia	Abeer Alshehri EMAIL School of Computing and Information Systems The University of Melbourne, Melbourne, Australia Department of Computer Science and Information Systems King Khalid University, Abha, Saudi Arabia Amal Abdulrahman EMAIL School of Computing Macquarie University, Sydney, Australia Hajar Alamri EMAIL Department of Computer Science and Information Systems King Khalid University, Abha, Saudi Arabia Tim Miller EMAIL School of Electrical Engineering and Computer Science The University of Queensland, Brisbane, Australia Mor Vered EMAIL School of Computing and Information Systems Monash University, Melbourne, Australia
Pseudocode	Yes	Algorithm 1 Explanation Generation Algorithm Input: Oi, oi, Gp, Gc, and Posterior probability over G Output: Explanation list Ωfor all pairs (Gp, Gc) 1: Ω [] {Initialize explanation list} 2: for oi O do 3: for g Gp do 4: for g Gc do 5: ωi woe(g/g : oi \| Oi)) {Compute Weight of Evidence (Wo E)} 6: Ω Ω {(g, g ) = ωi, oi } {Add explanation to list} 7: end for 8: end for 9: end for 10: return Ω
Open Source Code	No	All data will be made available upon request.
Open Datasets	Yes	We evaluate the computational cost of the XGR model over eight online GR benchmark domains (Vered et al., 2018) We obtained the dataset from Penney et al. (2021), which was collected from professional Star Craft tournaments available as videos on demand from 2016 and 2017.
Dataset Splits	No	The paper mentions collecting data for user studies and identifying instances for analysis (e.g., "a total of 132 instances out of the six samples" for Star Craft), but it does not specify explicit training/test/validation dataset splits with percentages, sample counts, or methodology for machine learning model evaluation.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running its experiments.
Software Dependencies	No	We used a STRIPS-like discrete planner to generate plan hypotheses derived from the domain theory and observations as our ground truth.
Experiment Setup	No	The paper describes the setup for human studies and computational evaluations, including the number of scenarios and participants, but does not provide specific model hyperparameters or training configurations for the XGR model (e.g., learning rate, batch size, number of epochs, optimizer settings) in the main text.