reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explaining the Behavior of Black-Box Prediction Algorithms with Causal Learning

Authors: Numair Sani, Daniel Malinsky, Ilya Shpitser

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a simulation study that highlights key issues and demonstrates the strength of our approach. We apply a version of our proposal to two datasets: annotated image data for bird classification and annotated chest X-ray images for pneumonia detection. ... We conduct two real data experiments to demonstrate the utility of our approach.
Researcher Affiliation	Collaboration	Numair Sani EMAIL Sani Analytics, Mumbai, MH India. Daniel Malinsky EMAIL Department of Biostatistics, Columbia University, New York, NY USA. Ilya Shpitser EMAIL Department of Computer Science, Johns Hopkins University, Baltimore, MD USA.
Pseudocode	No	The paper describes methods and algorithms narratively but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using third-party tools like TETRAD freeware and implementations of LIME and SHAP, providing links to their repositories. However, it does not provide any concrete access information for the authors' own implementation code or methodology described in the paper.
Open Datasets	Yes	First, we study a neural network for bird classification, trained on the Caltech-UCSD 200-2011 image dataset (Wah et al., 2011). ... Second, we follow essentially the same procedure to explain the behavior of a pneumonia detection neural network, trained on a subset of the Chest X-ray8 dataset (Wang et al., 2017a). ... Both data sources are publicly available online.
Dataset Splits	Yes	This yields a dataset of 3538 images, which is then partitioned into training, validation, and testing datasets of 2489, 520, and 529 images respectively. ... Using the same architecture as for the previous experiment and reserving 55 images for testing, Res Net18 achieves an accuracy of 74.55%.
Hardware Specification	No	The paper mentions the use of ResNet18 architecture and training parameters, but it does not specify any particular hardware (e.g., GPU, CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using the TETRAD package, the sklearn library for logistic regression, and implementations of LIME and SHAP. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	The model is trained for 15 epochs with a batch size of 64 and using the SGD optimizer with a learning rate of 0.01 and a momentum of 0.09. Additionally, we schedule a learning rate decay with a step size of 7 and γ = 0.1. ... We run FCI on each replicate with independence test rejection threshold (a tuning parameter) set to α = .05 and α = .01 for the birds and X-ray datasets, respectively, with the knowledge constraint imposed that outcome b Y cannot cause any of the interpretable features. Here FCI is used with the χ2 independence test, and we limit the maximum conditioning set size to 4 for computational tractability in the birds dataset.