reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Feature Responsiveness Scores: Model-Agnostic Explanations for Recourse

Authors: Seung Hyun Cheon, Anneke Wernerfelt, Sorelle Friedler, Berk Ustun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct an extensive empirical study on the responsiveness of explanations in lending. Our results show that standard practices in consumer finance can backfire by presenting consumers with reasons without recourse, and demonstrate how our approach improves consumer protection by highlighting responsive features and identifying fixed predictions.
Researcher Affiliation	Academia	Seung Hyun Cheon UC San Diego Anneke Wernerfelt Haverford College Sorelle A. Friedler Haverford College Berk Ustun UC San Diego
Pseudocode	Yes	Algorithm 1 Sample Reachable Points Algorithm 2 Enumerate Reachable Points
Open Source Code	Yes	We include a Python library to compute feature responsiveness scores available on Git Hub.
Open Datasets	Yes	We work with three publicly available consumer finance classification datasets. ... heloc n = 5, 842 d = 43 FICO [23] ... german n = 1, 000 d = 36 Dua & Graff [15] ... givemecredit n = 120, 268 d = 23 Kaggle [32]
Dataset Splits	Yes	We split each dataset into a training sample (80%; to train models and tune parameters) and a test sample (20%; to evaluate out-of-sample performance).
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies	No	The paper mentions using a 'Python library' and various machine learning models (Logistic Regression, XGBoost, Random Forests, SHAP, LIME) but does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	We fit models using (1) logistic regression (LR), (2) XGBoost (XGB), and (3) random forests (RF). For each model, we construct featurehighlighting explanations for each person who is denied credit that highlight up to four features... We chose the sample size N = 500 to ensure that the 100(1 α)% confidence interval in Appendix A.2 had an upper bound 0.01 when ˆµj(x) = 0 with α = 0.01.