reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explanation Shift: How Did the Distribution Shift Impact the Model?

Authors: Carlos Mougan, Klaus Broelemann, Gjergji Kasneci, Thanassis Tiropanis, Steffen Staab

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical and experimental evidence and demonstrate the effectiveness of our approach on synthetic and real data. Additionally, we release an open-source Python package, skshift, which implements our method and provides usage tutorials for further reproducibility.
Researcher Affiliation	Collaboration	Carlos Mougan AI Office European Commission & University of Southampton. Klaus Broelemann Schufa Holding AG, Germany Gjergji Kasneci Schufa Holding AG & Technical University of Munich Thanassis Tiropanis University of Southampton Steffen Staab University of Stuttgart & University of Southampton
Pseudocode	No	The paper describes methods and uses mathematical formulations but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Additionally, we release an open-source Python package, skshift, which implements our method and provides usage tutorials for further reproducibility. To ensure reproducibility, we make the data, code repositories, and experiments publicly available https://github.com/cmougan/Explanation Shift. Also, an open-source Python package skshift, available at: https://skshift.readthedocs.io/
Open Datasets	Yes	In the main body of the paper we base our comparisons on the UCI Adult Income dataset Dua & Graff (2017) and on synthetic data. In the Appendix, we extend experiments to several other datasets, which confirm our findings: ACS Travel Time, ACS Employment, Stackoverflow dataset (Stackoverflow, 2019).
Dataset Splits	Yes	The model gψ is trained each time on each state using only the Dnew X in the absence of the label, and a 50/50 random train-test split evaluates its performance.
Hardware Specification	Yes	Experiments were run on a 4 v CPU server with 32 GB RAM.
Software Dependencies	Yes	We used shap version 0.41.0 and lime version 0.2.0.1 as software packages.
Experiment Setup	Yes	We train the fθ on Dtr,ρ=0 using a gradient-boosted decision tree, while for gψ : S(fθ, Dval,ρ X ) {0, 1}, we train on different datasests with different values of ρ. For gψ we use a logistic regression. In this experiment, we changed the hyperparameters of the original model: for the decision tree, we varied the depth of the tree, while for the gradient-boosted decision trees, we changed the number of estimators, and for the random forest, both hyperparameters.