Accurate Estimation of Feature Importance Faithfulness for Tree Models

Authors: Mateusz Gajewski, Adam Karczmarz, Mateusz Rapicki, Piotr Sankowski

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we evaluated different methods for calculating the Prediction Gap. Our study encompassed the exact algorithm as well as two sampling techniques integration techniques: Monte Carlo (MC) and Quasi-Monte Carlo (QMC). With the increase in iteration count, the outputs of MC and QMC visibly converged to the output of our exact algorithm; this confirms good numerical stability of our approach. Specifically, the Normalized Mean Absolute Error (NMAE) for MC was 0.13 for single models and decreased to 0.01 for bigger models. In comparison, QMC exhibited NMAEs of approximately 0.05 for single models and 0.002 for bigger ones. These results underscore the efficacy of sampling methods in the context of computing the PG, particularly for more sophisticated model structures.
Researcher Affiliation Collaboration 1 Faculty of Mathematics, Informatics and Mechanics University of Warsaw, Warsaw, Poland 2IDEAS NCBR 3 Faculty of Computing and Telecommunications Poznan University of Technology, Poznan, Poland 4 MIM Solutions
Pseudocode Yes Algorithm 1: Computing Π(u, v) for all leaf pairs u, v given model T , a feature vector x, and important features S [d]
Open Source Code Yes Code https://github.com/rapicki/prediction-gap
Open Datasets Yes 1. Red Wine Quality (Cortez et al. 2009). The dataset con-tains 11 features wine, all numerical and continuous. The task is to predict the score of a wine, which is an integer between 1 and 10. We considered it as a regression task. The dataset contains 1 599 examples and has 11 features. 2. California Housing (Torgo 2023) The dataset contains information from the 1990 Californian census. There are 8 numerical characteristics, and one categorical proximity to the ocean. For the reasons outlined before, we decided to drop this feature and use a modified dataset. The task is to predict the median value of the house. The dataset contains 20 640 examples and has 8 features. 3. Parkinson Telemonitoring Data (Tsanas and Little 2009) The dataset contains 5 875 voice measurements from Parkinson s disease patients, collected at home. It includes 17 numerical features, after dropping 3 categorical columns(ID, age) with a task to predict UPDRS motor and total scores.
Dataset Splits Yes In each case, the data set was split 80:20 into training and test sets.
Hardware Specification Yes The computations were carried out on a Format Server THOR E221 (Supermicro) server equipped with two AMD EPYC 7702 64-Core processors and 512 GB of RAM with operation system Ubuntu 22.04.1 LTS.
Software Dependencies No The paper mentions software like Python, C++, numpy.float32, and XGBoost, but does not specify their version numbers.
Experiment Setup Yes Model type m. Recall that, in our case, there were two model types for a fixed dataset. The standard deviation σ of a Gaussian used to perturb a feature. We used the values {0.1, 0.3, 1.0}. For each number of iterations i {100, 500, 1000, 2000, 4000, 6000, 8000, 10000, 15000, 20000, 25000, 30000, 35000}, we ran our closed-form algorithm and the sampling method in question with iteration count i, both estimating the same value PG2(x, S) over N = 20 000 random pairs (x, S).