reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective

Authors: Satyapriya Krishna, Tessa Han, Alex Gu, Steven Wu, Shahin Jabbari, Himabindu Lakkaraju

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we formalize and study the disagreement problem in explainable machine learning. More specifically, we define the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how practitioners resolve these disagreements. We first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction, and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and six different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements.
Researcher Affiliation	Academia	Satyapriya Krishna EMAIL Harvard University Tessa Han EMAIL Harvard University Alex Gu EMAIL Massachusetts Institute of Technology Steven Wu EMAIL Carnegie Mellon University Shahin Jabbari EMAIL Drexel University Himabindu Lakkaraju EMAIL Harvard University
Pseudocode	No	The paper formally defines several metrics for measuring disagreement (Section A), but these are mathematical formulations rather than structured pseudocode or algorithm blocks describing a computational procedure.
Open Source Code	Yes	The full set of figures showing explanation disagreement for the two datasets, over four models, measured by six metrics, displayed in two formats (metric mean in heatmap and metric distribution in boxplot) for varying values of top-k features can be found in the code repository accompanying this paper.
Open Datasets	Yes	For tabular data, we use the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) (Pro Publica) and German Credit datasets (Repository). The COMPAS dataset comprises seven features on the demographics, criminal history, and prison time of 4,937 defendants. For text data, we use Antonio Gulli (AG) s corpus of news articles (AG_News) (ag, 2005). It contains 127,600 sentences (collected from 1,000,000+ articles from 2,000+ sources with a vocabulary size of 95,000+ words). For image data, we use the Image Net 1k (Russakovsky et al., 2015; ima, 2015) object recognition dataset. It contains 1,381,167 images belonging to 1000 categories. We obtain segmentation maps from PASCAL VOC 2012 (voc, 2012) that are directly used as super-pixels for the explanation methods.
Dataset Splits	Yes	For the COMPAS dataset, we train the four models based on an 80%-20% train-test split of the dataset, using features to predict the COMPAS risk score group. For the German credit dataset, we train the same four models based on an 80%-20% train-test split of the dataset, using features to predict the credit risk group. For text data, we trained a widely-used LSTM-based text classifier, based on 120,000 training samples and 7,600 test samples, to predict the news category of the article from which a sentence was obtained.
Hardware Specification	No	The paper mentions using pre-trained models like ResNet-18 and training an LSTM-based classifier, but it does not specify any particular CPU, GPU, or other hardware used for running their experiments or training their models.
Software Dependencies	No	The paper mentions using several explanation methods (LIME, Kernel SHAP, Gradient-based methods) and models (LSTM, ResNet-18) and provides links to their original papers, but it does not specify the version numbers of any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	For tabular data, we train four models: logistic regression, densely connected feed-forward neural network, random forest, and gradient-boosted tree. For text data, we train a widely-used vanilla LSTM-based text classifier on AG_News (Zhang et al., 2015) corpus. For image data, we use the pre-trained Res Net-18 (He et al., 2016) for Image Net. For explanation methods with a sample size hyper-parameter, we either run the explanation method to convergence (i.e., select a sample size such that an increase in the number of samples does not significantly change the explanations) or use a sample size that is much higher than the sample size recommended by previous work. For both COMPAS and German Credit datasets, we used 2,000 samples for relevant methods (i.e. LIME, Kernel SHAP, Smooth GRAD, and Integrated Gradients). Integrated Gradients explanations were generated using 500 steps... Smooth GRAD explanations were generated using 500 samples... For LIME and Kernel SHAP, we chose 100 perturbations to train the surrogate model.