reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Goal Recognition Design for General Behavioral Agents using Machine Learning

Authors: Robert Kasumba, Guanghui Yu, Chien-Ju Ho, Sarah Keren, William Yeoh

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive simulations, we demonstrate that our approach outperforms existing methods in reducing wcd and enhances runtime efficiency. Moreover, our approach also adapts to settings in which existing approaches do not apply, such as those involving flexible budget constraints, more complex environments, and suboptimal agent behavior. Finally, we conducted human-subject experiments that demonstrate that our method creates environments that facilitate efficient goal recognition from human decision-makers.
Researcher Affiliation	Academia	Robert Kasumba EMAIL Washington University in Saint Louis Guanghui Yu EMAIL Washington University in Saint Louis Chien-Ju Ho EMAIL Washington University in Saint Louis Sarah Keren EMAIL Technion Israel Institute of Technology William Yeoh EMAIL Washington University in Saint Louis
Pseudocode	No	The paper describes the optimization procedure as a 'discrete gradient descent procedure' in Section 3.3 but does not present it in a structured pseudocode or algorithm block.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for their methodology, nor does it provide a link to a code repository for their work. It only links to the OpenReview forum and the Overcooked-AI environment, which is a third-party tool used.
Open Datasets	No	To build the predictive model for wcd, we curate a training dataset through simulations. For an environment w and agent behavioral model h, we can obtain wcd(w, h) by solving for the agent s actions towards different goals. After collecting a training dataset, we train the predictive model using a convolutional neural network. The implementation details are in Section 5.1.1 and the appendix. In Experiment 1: Collection of Human Behavioral Data, it states: The collected human data were split into training (160 workers, 70,000 user decisions), validation, and testing sets (20 workers, 8,800 decisions each). The paper describes how the data was generated and collected but does not provide concrete access information (link, DOI, repository) for these datasets.
Dataset Splits	Yes	The collected human data were split into training (160 workers, 70,000 user decisions), validation, and testing sets (20 workers, 8,800 decisions each).
Hardware Specification	Yes	All experiments were run on a computing cluster equipped with 40 CPU cores (Intel Xeon Gold 6148 @ 2.40GHz), a single NVIDIA Tesla V100 SXM2 GPU (32GB), and up to 80GB of memory.
Software Dependencies	No	The paper states: 'Python 3.10 and widely used scientific libraries. Py Torch was our main deep learning framework, with Num Py and pandas handling numerical computation and data processing.' While Python 3.10 is specified, no version numbers are provided for PyTorch, NumPy, or pandas, which are key components.
Experiment Setup	Yes	We used Adam optimizer and MSE loss and tested learning rates of 0.1, 0.01, 0.001, and 0.0001. A learning rate of 0.001 consistently produced the lowest validation error... The best-performing configuration, CNN (100K, 0.001) combined with our gradient-based optimization, achieved the greatest reduction in wcd...