reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explaining Confident Black-Box Predictions

Authors: Evan Yao, Retsef Levi, Assaf Avrahami, Abraham Meidan

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This approach is evaluated on 6 real world datasets in application areas ranging from healthcare to criminal justice and finance. Empirical results suggest that this methodology finds rule lists of length at most 5 with ABBR within 7.4% of the optimal ABBR of any explanation, while checklists provide greater interpretability for a small cost in performance. Section 5 Empirical Results This section presents empirical results for Rule List Search and Checklist Search on 6 real-world datasets for binary classification shown. Our results are organized into two subsections: Section 5.1: Experiments that show our Rule List Search and Checklist Search algorithms generate strong rules that achieve near-optimal ABBR on all 6 datasets. Ablation studies are performed to explain the value of higher order rules, longer rules, and interpretability.
Researcher Affiliation	Collaboration	Evan Yao EMAIL Operations Research Center Massachusetts Institute of Technology Retsef Levi EMAIL Sloan School of Management Massachusetts Institute of Technology Assaf Avrahami EMAIL Wizsoft Abraham Meidan EMAIL Wizsoft
Pseudocode	Yes	Reproducibility. Code and datasets will be released on Git Hub upon acceptance. Pseudo-code is available in Appendix C.
Open Source Code	No	Reproducibility. Code and datasets will be released on Git Hub upon acceptance.
Open Datasets	Yes	Table 1: Overview of Datasets used in Empirical Experiments. This tables shows the number of rows N, features D, examples of some features, the target outcome of interest and the AUC of the black-box Random Forest Classifier we seek to explain. Note that datasets with N = 30, 000 have been sampled from a larger dataset to reduce computational time. Sources for the 6 datasets are as follows: Dua & Graff (2019a), Dua & Graff (2019b), FICO (2018), Pro Publica (2016), for Biotechnology Information (2024), Dua & Graff (2019c)
Dataset Splits	Yes	Experiments were run on MIT s Engage Compute Cluster over 100 train/test splits, taking around 1 hour of compute time when parallelized with 24 cores. Each instance of Rule List Search or Checklist Search take no more than 10 seconds to run on the 6 datasets in Table 1. Results shown are the average of the test performance across the 100 train/test splits.
Hardware Specification	No	Experiments were run on MIT s Engage Compute Cluster over 100 train/test splits, taking around 1 hour of compute time when parallelized with 24 cores.
Software Dependencies	No	The paper mentions "Wiz Why Meidan (2005)", "Apriori Algorithm Agrawal et al. (1996)" and "Random Forest Regressor" but does not provide specific version numbers for any software libraries or tools used in their implementation.
Experiment Setup	Yes	Setup. Rule lists are generated with L {2, 3, 4} (maximum order of each rule) and M {3, 5} (maximum number of rules in the rule list). Checklists are generated with K {5, 7} (number of conditions). The target support s is chosen from {0.1, 0.2}. The black-box predictions {b(Xn)}N n=1 are generated from a Random Forest Classifier trained with 500 estimators.