Explaining Confident Black-Box Predictions

Authors: Evan Yao, Retsef Levi, Assaf Avrahami, Abraham Meidan

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This approach is evaluated on 6 real world datasets in application areas ranging from healthcare to criminal justice and finance. Empirical results suggest that this methodology finds rule lists of length at most 5 with ABBR within 7.4% of the optimal ABBR of any explanation, while checklists provide greater interpretability for a small cost in performance. Section 5 Empirical Results This section presents empirical results for Rule List Search and Checklist Search on 6 real-world datasets for binary classification shown. Our results are organized into two subsections: Section 5.1: Experiments that show our Rule List Search and Checklist Search algorithms generate strong rules that achieve near-optimal ABBR on all 6 datasets. Ablation studies are performed to explain the value of higher order rules, longer rules, and interpretability.
Researcher Affiliation Collaboration Evan Yao EMAIL Operations Research Center Massachusetts Institute of Technology Retsef Levi EMAIL Sloan School of Management Massachusetts Institute of Technology Assaf Avrahami EMAIL Wizsoft Abraham Meidan EMAIL Wizsoft
Pseudocode Yes Reproducibility. Code and datasets will be released on Git Hub upon acceptance. Pseudo-code is available in Appendix C.
Open Source Code No Reproducibility. Code and datasets will be released on Git Hub upon acceptance.
Open Datasets Yes Table 1: Overview of Datasets used in Empirical Experiments. This tables shows the number of rows N, features D, examples of some features, the target outcome of interest and the AUC of the black-box Random Forest Classifier we seek to explain. Note that datasets with N = 30, 000 have been sampled from a larger dataset to reduce computational time. Sources for the 6 datasets are as follows: Dua & Graff (2019a), Dua & Graff (2019b), FICO (2018), Pro Publica (2016), for Biotechnology Information (2024), Dua & Graff (2019c)
Dataset Splits Yes Experiments were run on MIT s Engage Compute Cluster over 100 train/test splits, taking around 1 hour of compute time when parallelized with 24 cores. Each instance of Rule List Search or Checklist Search take no more than 10 seconds to run on the 6 datasets in Table 1. Results shown are the average of the test performance across the 100 train/test splits.
Hardware Specification No Experiments were run on MIT s Engage Compute Cluster over 100 train/test splits, taking around 1 hour of compute time when parallelized with 24 cores.
Software Dependencies No The paper mentions "Wiz Why Meidan (2005)", "Apriori Algorithm Agrawal et al. (1996)" and "Random Forest Regressor" but does not provide specific version numbers for any software libraries or tools used in their implementation.
Experiment Setup Yes Setup. Rule lists are generated with L {2, 3, 4} (maximum order of each rule) and M {3, 5} (maximum number of rules in the rule list). Checklists are generated with K {5, 7} (number of conditions). The target support s is chosen from {0.1, 0.2}. The black-box predictions {b(Xn)}N n=1 are generated from a Random Forest Classifier trained with 500 estimators.