reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Accurate Estimation of Feature Importance Faithfulness for Tree Models

Authors: Mateusz Gajewski, Adam Karczmarz, Mateusz Rapicki, Piotr Sankowski

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we evaluated different methods for calculating the Prediction Gap. Our study encompassed the exact algorithm as well as two sampling techniques integration techniques: Monte Carlo (MC) and Quasi-Monte Carlo (QMC). With the increase in iteration count, the outputs of MC and QMC visibly converged to the output of our exact algorithm; this confirms good numerical stability of our approach. Specifically, the Normalized Mean Absolute Error (NMAE) for MC was 0.13 for single models and decreased to 0.01 for bigger models. In comparison, QMC exhibited NMAEs of approximately 0.05 for single models and 0.002 for bigger ones. These results underscore the efficacy of sampling methods in the context of computing the PG, particularly for more sophisticated model structures.
Researcher Affiliation	Collaboration	1 Faculty of Mathematics, Informatics and Mechanics University of Warsaw, Warsaw, Poland 2IDEAS NCBR 3 Faculty of Computing and Telecommunications Poznan University of Technology, Poznan, Poland 4 MIM Solutions
Pseudocode	Yes	Algorithm 1: Computing Π(u, v) for all leaf pairs u, v given model T , a feature vector x, and important features S [d]
Open Source Code	Yes	Code https://github.com/rapicki/prediction-gap
Open Datasets	Yes	1. Red Wine Quality (Cortez et al. 2009). The dataset con-tains 11 features wine, all numerical and continuous. The task is to predict the score of a wine, which is an integer between 1 and 10. We considered it as a regression task. The dataset contains 1 599 examples and has 11 features. 2. California Housing (Torgo 2023) The dataset contains information from the 1990 Californian census. There are 8 numerical characteristics, and one categorical proximity to the ocean. For the reasons outlined before, we decided to drop this feature and use a modified dataset. The task is to predict the median value of the house. The dataset contains 20 640 examples and has 8 features. 3. Parkinson Telemonitoring Data (Tsanas and Little 2009) The dataset contains 5 875 voice measurements from Parkinson s disease patients, collected at home. It includes 17 numerical features, after dropping 3 categorical columns(ID, age) with a task to predict UPDRS motor and total scores.
Dataset Splits	Yes	In each case, the data set was split 80:20 into training and test sets.
Hardware Specification	Yes	The computations were carried out on a Format Server THOR E221 (Supermicro) server equipped with two AMD EPYC 7702 64-Core processors and 512 GB of RAM with operation system Ubuntu 22.04.1 LTS.
Software Dependencies	No	The paper mentions software like Python, C++, numpy.float32, and XGBoost, but does not specify their version numbers.
Experiment Setup	Yes	Model type m. Recall that, in our case, there were two model types for a fixed dataset. The standard deviation σ of a Gaussian used to perturb a feature. We used the values {0.1, 0.3, 1.0}. For each number of iterations i {100, 500, 1000, 2000, 4000, 6000, 8000, 10000, 15000, 20000, 25000, 30000, 35000}, we ran our closed-form algorithm and the sampling method in question with iteration count i, both estimating the same value PG2(x, S) over N = 20 000 random pairs (x, S).