reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Causal Explanations for Sequential Decision Making

Authors: Samer B. Nashed, Saaduddin Mahmud, Claudia V. Goldman, Shlomo Zilberstein

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we performed a series of experiments to evaluate the practicality and effectiveness of the proposed system, focusing on real-world computational demands and the validity and reliability of metrics for comparing approximate and exact causal methods. Finally, we present two user studies that reveal user preferences for certain types of explanations and demonstrate a strong preference for explanations generated by our framework compared to those from other state-of-the-art systems.
Researcher Affiliation	Academia	Samer B. Nashed , University of Massachusetts Amherst, USA SAADUDDIN MAHMUD, University of Massachusetts Amherst, USA CLAUDIA V. GOLDMAN, Hebrew University, Israel SHLOMO ZILBERSTEIN, University of Massachusetts Amherst, USA
Pseudocode	Yes	Algorithm 1 Determine Weak Causes ... Algorithm 8 Mean RESP
Open Source Code	No	The paper does not provide an explicit statement or link to the authors' own source code for the methodology described.
Open Datasets	Yes	In our experiments, 60 states are sampled from the Lunar Lander10 MDP, from Open AI Gym [16]... in four environments: Lunar Lander, Taxi11, Black Jack12, and a version of Highway Env13 (highway-fast-v0; Kinematic Observation).
Dataset Splits	No	The paper describes sampling states for evaluation within environments ('60 states were sampled with replacement from each domain'), rather than providing explicit training/test/validation splits for a machine learning dataset.
Hardware Specification	Yes	All of our experiments were conducted on a Dell XPS 13 9310 Laptop with an 11th Gen Intel(R) Core(TM) i7-1185G7 3.00GHz processor and 16GB 4267MHz LPDDR4x RAM.
Software Dependencies	No	The paper mentions using 'stable baseline 3' for deep Q-learning, but does not provide a specific version number for this or any other key software library.
Experiment Setup	No	The paper mentions that policies were learned via 'deep Q-learning' or 'value iteration' and used a 'multi-layer perceptron', but it does not specify concrete hyperparameters such as learning rate, batch size, or number of epochs for these training processes.