reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Regularizing Black-box Models for Improved Interpretability

Authors: Gregory Plumb, Maruan Al-Shedivat, Ángel Alexander Cabrera, Adam Perer, Eric Xing, Ameet Talwalkar

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that post-hoc explanations for EXPO-regularized models have better explanation quality, as measured by the common ﬁdelity and stability metrics. We verify that improving these metrics leads to signiﬁcantly more useful explanations with a user study on a realistic task.4 Experimental Results
Researcher Affiliation	Collaboration	Gregory Plumb Carnegie Mellon University EMAIL Maruan Al-Shedivat Carnegie Mellon University EMAIL Ángel Alexander Cabrera Carnegie Mellon University EMAIL Adam Perer Carnegie Mellon University EMAIL Eric Xing CMU, Petuum Inc EMAIL Ameet Talwalkar CMU, Determined AI EMAIL
Pseudocode	Yes	Algorithm 1 Neighborhood-ﬁdelity regularizer
Open Source Code	Yes	1https://github.com/GDPlumb/ExpO
Open Datasets	Yes	seven regression problems from the UCI collection [Dheeru and Karra Taniskidou, 2017], the MSD dataset4, and Support2 which is an in-hospital mortality classiﬁcation problem5. Dataset statistics are in Table 2. 4As in [Bloniarz et al., 2016], we treat the MSD dataset as a regression problem with the goal of predicting the release year of a song. 5http://biostat.mc.vanderbilt.edu/wiki/Main/Support Desc.
Dataset Splits	No	The paper mentions evaluating on test data and discusses parameters for neighborhoods (Nx and Nreg x), but it does not provide explicit train/validation/test dataset splits (percentages or counts) in the main text.
Hardware Specification	Yes	Each model takes less than a few minutes to train on an Intel 8700k CPU, so computational cost was not a limiting factor in our experiments.
Software Dependencies	No	The paper mentions using 'SGD with Adam [Kingma and Ba, 2014]' as an optimization algorithm, but it does not specify any software libraries, frameworks, or their version numbers (e.g., Python, TensorFlow, PyTorch, scikit-learn versions).
Experiment Setup	Yes	The network architectures and hyper-parameters are chosen using a grid search; for more details see Appendix A.3. For the ﬁnal results, we set Nx to be N(x, σ) with σ = 0.1 and N reg x to be N(x, σ) with σ = 0.5.