reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explaining Decisions of Agents in Mixed-Motive Games

Authors: Maayan Orner, Oleg Maksimov, Akiva Kleinerman, Charles Ortiz, Sarit Kraus

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents the results of our evaluation experiments in Diplomacy and Risk. The experiments conducted in the COP game are summarized here and described in detail in the appendix. We conducted two complementary studies with humans in two different environments.
Researcher Affiliation	Academia	1Department of Computer Science, Bar-Ilan University, Israel 2SRI International, USA
Pseudocode	Yes	Algorithm 1: Simulate
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	No	The paper mentions using "no-press Diplomacy", "Communicate Out of Prison (COP) game", and a "simplified version of Risk". While Diplomacy is a known game, the specific 'game environment from (Paquette et al. 2019)' is cited as an external resource used, not a dataset created and shared by the authors. The COP game was designed by the authors, and the experimental data generated (e.g., "randomly generated 30 Diplomacy game states", "12 board states" for Risk, "simulated the game until it included some chat history") are not stated to be publicly available with access information.
Dataset Splits	No	The paper describes generating specific game states for human user studies (e.g., "randomly generated 30 Diplomacy game states", "generated 12 board states" for Risk). However, it does not provide specific train/test/validation dataset splits or methodologies typically used for model training and evaluation.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies	No	The paper mentions using GPT-4 for the COP game and neural policy networks for Diplomacy, but it does not specify versions for any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software dependencies.
Experiment Setup	Yes	Explanation estimation: To explain action ai e (e denotes explained) given state s, the following steps are performed: 1. Simulate the next turn from s for k times, where agent i performs action ai e, and all other agents follow their respective policies. 2. Estimate the utility values of each outcome using the value functions and rewards (algorithm 1 line 13). ... We run k simulations from state st, where agent i performs action ai e and all other agents follow their respective policies. Then, we extract the most commonly used action of each agent accordingly. ... For the probable actions-based explanations, we examined how the temperature parameter affected the game outcomes. We found that using a temperature τ = 0, which corresponds to greedy decoding (our approach), sometimes led to outcomes that were not probable when using τ = 0.7.