Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions

Authors: Anna Hedström, Philine Lou Bommer, Thomas F Burns, Sebastian Lapuschkin, Wojciech Samek, Marina MC Höhne

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive cross-domain benchmarks on natural language processing, vision, and tabular tasks, we provide first-of-its-kind insights into the comparative performance of various interpretable methods. This includes local linear approximators, global feature visualisation methods, large language models as post-hoc explainers, and sparse autoencoders. Our contributions are important to the interpretability and AI safety communities, offering a principled, unified approach for evaluation.
Researcher Affiliation Academia 1 Department of Electrical Engineering and Computer Science, Technical University of Berlin 2 BIFOLD Berlin Institute for the Foundations of Learning and Data 3 Institute for Computational and Experimental Research in Mathematics, Brown University 4 Scientific Artificial Intelligence Center, Cornell University, USA 5 Neural Coding and Brain Computing Unit, Okinawa Insitute of Science and Technology 6 Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute 7 Department of Computer Science, University of Potsdam 8 UMI Lab, Leibniz Institute of Agricultural Engineering and Bioeconomy e.V. (ATB)
Pseudocode Yes Algorithm 1 GEF Evaluator
Open Source Code Yes https://github.com/annahedstroem/GEF
Open Datasets Yes Table 2: An overview of datasets, and models, with references in Appendix A.4. A semicolon separates models used per dataset. Appendix A.4.1: We employ various models for vision, text, and tabular tasks in our experiments. See Table 2. For vision classification, we use Image Net-1K for object recognition (Russakovsky et al., 2015) with Res Net18 (He et al., 2016); Pathology, and Derma for medical image analysis with proposed Med CNN architecture (Yang et al., 2023); and MNIST (Le Cun et al., 2010), and f MNIST, (Xiao et al., 2017) for digit, and fashion recognition with Le Net (Le Cun et al., 1998). For text classification, we use SMS Spam (Almeida et al., 2011) with a tiny, fine-tuned BERT model (Romero, 2024); IMDb (Maas et al., 2011) with Pythia (Alignment Research, 2024); and SST-2 (Socher et al., 2013) with a tiny, fine-tuned BERT model (Vitya Vitalich, 2023). For tabular classification, we use Adult (Becker & Kohavi, 1996) and, COMPAS (Pro Publica, 2016), with 3-layer MLP; and Avila (Stefano et al., 2018) with 2-layer MLP.
Dataset Splits No The paper mentions using 'test set' in Appendix A.2.1: 'where N is the number of samples in the test set, denoted X.' and refers to standard benchmark datasets, but it does not provide specific percentages, absolute sample counts for each split, or explicit details on how the datasets were partitioned for their experiments.
Hardware Specification Yes Appendix A.4.3 Hardware The experiments were conducted using two hardware configurations: a cluster with four Tesla V100S-PCIE-32GB GPUs, each offering 32 GB of memory, and a DGX-2 system featuring eight NVIDIA A100-SXM4-40GB GPUs, each with 40 GB of memory. Both setups support the NVIDIA driver version 535.161.07, and CUDA 12.2.
Software Dependencies No Appendix A.4.2 Tooling Several libraries, and open-source implementations enabled this work, including transformers (Wolf et al., 2020), Open XAI (Agarwal et al., 2022b), Captum (Kokhlikyan et al., 2020), Zennit (Anders et al., 2021), Shap (Lundberg & Lee, 2017), Activation-Maximization (Nguyen, 2020), and Horama (Fel et al., 2024). For metric implementation, and meta-evaluation, we use the Quantus (Hedström et al., 2023b), and Meta Quantus (Hedström et al., 2023a) libraries, respectively. The paper lists several software libraries and frameworks but does not provide specific version numbers for most of them (e.g., transformers, Open XAI, Captum, Zennit, Shap, Quantus, Meta Quantus), except for CUDA 12.2 and NVIDIA driver version 535.161.07.
Experiment Setup Yes Section 5.2 Introducing GEF Evaluator Unless stated otherwise, we use Euclidean distance for δ in the functional distortion calculations (Definition 1), and define ρ using Spearman Rank Correlation... For the experiments, we set Z = 5... In Appendix A.2, we provide further details on the implementation, including how to generate the perturbation path (line 2), and how to tune parameters (line 2). Section 6 Experiments Optimization steps are set to 50, 100, and 250, otherwise, default values are used as provided in the respective publications... For local methods, two variants of Layer-wise Relevance Propagation (LRP), the ε-rule (LRP-ε) (Bach et al., 2015) with ε = 1e-6... Smooth Grad (SMG) (Smilkov et al., 2017) with 10 noisy samples, and noise level 0.1/(xmax xmin), Integrated Gradients (INT-G) (Sundararajan et al., 2017) with 10 iterations, and zero baseline... Two Shapley-based algorithms (Lundberg & Lee, 2017) are included: Gradient SHAP (SHAP-G) with 10 samples... LLM prompts describe the model s classification task, and prediction certainty before, and after model perturbation (Section 5.1). The temperature is set to 0 for deterministic outputs... For comparability, global, and local explanations are normalised by dividing the attribution map by the square root of its average second-moment estimate (Equation 21) (Binder et al., 2022), with further explanation preprocessing details provided in Appendix A.4.4. Appendix A.7 Ablation Study For each hyperparameter, i.e., the number of perturbed models M, the length of the perturbation path Z, the number of summation steps T, and the number of samples K, we enumerated over values from 0 to 20, while fixing the others at a default value of 10.