Efficient and Accurate Explanation Estimation with Distribution Compression

Authors: Hubert Baniecki, Giuseppe Casalicchio, Bernd Bischl, Przemyslaw Biecek

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we empirically validate that the CTE paradigm improves explanation estimation across 4 methods, 2 model classes, and over 50 datasets. We compare CTE to the widely adopted practice of i.i.d. sampling (see Appendix A for further motivation). We also report sanity check results for a more deterministic baseline sampling with k-medoids where centroids from the clustering define a coreset from the dataset. We measure the accuracy and effectiveness of explanation estimation with respect to a ground truth explanation (cf. Appendix I) that is estimated using a full validation dataset X, i.e. without sampling or compression.
Researcher Affiliation Academia Hubert Baniecki University of Warsaw EMAIL Giuseppe Casalicchio LMU Munich Munich Center for Machine Learning Bernd Bischl LMU Munich Munich Center for Machine Learning Przemyslaw Biecek University of Warsaw Warsaw University of Technology
Pseudocode Yes Listing 1: Code snippet showing the 3-line plug-in of distribution compression for SAGE estimation. Listing 2: Code snippet showing the 3-line plug-in of distribution compression for SHAP estimation. Listing 3: Code snippet showing the plug-in of distribution compression for EXPECTED-GRADIENTS. Listing 4: Code snippet showing the plug-in of distribution compression for FEATURE-EFFECTS.
Open Source Code Yes Code. We provide additional details on reproducibility in the Appendix, as well as the code to reproduce all experiments in this paper is available at https://github.com/hbaniecki/ compress-then-explain.
Open Datasets Yes We use the preprocessed datasets and pretrained neural network models from the well-established Open XAI benchmark (Agarwal et al., 2022). Further details on datasets and models are provided in Appendix D.2. Next, we aim to show the broader applicability of CTE by evaluating it on gradient-based explanations specific to neural networks, often fitted to larger unstructured datasets. We now study CTE together with EXPECTED-GRADIENTS of neural network models trained to 18 datasets (nvalid > 1000, d 32) from the Open MLCC18 (Bischl et al., 2021) and Open ML-CTR23 (Fischer et al., 2023) benchmark suites. Details on datasets and models are provided in Appendix D.2.
Dataset Splits Yes We first split all datasets in 75:25 (train:validation) ratio and left 48 datasets with nvalid > 1000 for our experiments. For the 30 smaller (d < 32) datasets, we train an XGBoost model with default hyperparameters (200 estimators) and explain it with SHAP, SAGE, FEATURE-EFFECTS. For the 18 bigger (d 32) datasets, we train a 3-layer neural network model with (128, 64) neurons in hidden Re LU layers and explain it with EXPECTED-GRADIENTS.
Hardware Specification Yes We rely on popular open-source implementations of the algorithms (see Appendix C) and perform efficiency experiments on a personal computer with an M3 chip. Experiments described in Sections 4.1, 4.2 & 4.5, and Figure 4, were computed on a personal computer with an M3 chip as justified in the beginning of Section 4. Experiments described in Sections 4.3 & 4.4 were computed on a cluster with 4 AMD Rome 7742 CPUs (256 cores) and 4TB of RAM for about 14 days combined.
Software Dependencies No We use the goodpoints Python package (Dwivedi & Mackey, 2021, MIT license). For SHAP, we use the KERNEL-SHAP and PERMUTATION-SHAP implementations from the shap Python package (Lundberg & Lee, 2017, MIT license) with default hyperparameters (notably, npermutations=10 in the latter). For SAGE, we use the KERNEL-SAGE and PERMUTATION-SAGE implementations from the sage Python package (Covert et al., 2020, MIT license). For EXPECTED-GRADIENTS, we aggregate with mean the integrated gradients explanations from the captum Python package (Kokhlikyan et al., 2020, BSD-3 license), for which we use default hyperparameters; notably, n steps=50 and method="gausslegendre".
Experiment Setup Yes We use the default hyperparameters of explanation algorithms (details are provided in Appendix D.1). For distribution compression, we use COMPRESS++ implemented in the goodpoints Python package (Dwivedi & Mackey, 2021), where we follow (Shetty et al., 2022) to use a Gaussian kernel k with σ = 2d. For SHAP, we use the KERNEL-SHAP and PERMUTATION-SHAP implementations from the shap Python package (Lundberg & Lee, 2017, MIT license) with default hyperparameters (notably, npermutations=10 in the latter). For SAGE, we use the KERNEL-SAGE and PERMUTATION-SAGE implementations from the sage Python package (Covert et al., 2020, MIT license). We use default hyperparameters; notably, a cross-entropy loss for classification and mean squared error for regression. For EXPECTED-GRADIENTS, we aggregate with mean the integrated gradients explanations from the captum Python package (Kokhlikyan et al., 2020, BSD-3 license), for which we use default hyperparameters; notably, n steps=50 and method="gausslegendre". For FEATURE-EFFECTS, we implement the partial dependence algorithm (Apley & Zhu, 2020; Moosbauer et al., 2021) ourselves for maximum computational speed in case of 2-dimensional plots, mimicking the popular open-source implementations.1 We use 100 uniformly distributed grid points for 1-dimensional plots and 10 10 uniformly distributed grid points for 2-dimensional plots. For the 30 smaller (d < 32) datasets, we train an XGBoost model with default hyperparameters (200 estimators) and explain it with SHAP, SAGE, FEATURE-EFFECTS. For the 18 bigger (d 32) datasets, we train a 3-layer neural network model with (128, 64) neurons in hidden Re LU layers and explain it with EXPECTED-GRADIENTS.