reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Implications of Model Indeterminacy for Explanations of Automated Decisions

Authors: Marc-Etienne Brunet, Ashton Anderson, Richard Zemel

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To explore the extent to which model indeterminacy may impact the consistency of explanations in a practical setting, we conduct a series of experiments.
Researcher Affiliation	Academia	Marc-Etienne Brunet University of Toronto Vector Institute EMAIL Ashton Anderson University of Toronto Vector Institute EMAIL Richard Zemel University of Toronto Columbia University Vector Institute EMAIL
Pseudocode	No	The paper describes methods and mathematical formulations but does not contain a structured pseudocode or algorithm block, nor is there a section explicitly labeled "Pseudocode" or "Algorithm".
Open Source Code	No	Experimental source code will be made available at github.com/mebrunet/model-indeterminacy
Open Datasets	Yes	We use three different (binary) risk assessment datasets (all available on Kaggle): UCI Credit Card [35], Give Me Some Credit, and Porto Seguro s Safe Driver Prediction. Their details can be found in Appendix B.1.
Dataset Splits	Yes	We first split each dataset into a development and a holdout set (70 / 30), and apply one-hot encoding and standard scaling. We then run a model selection process with three model classes: logistic regression (LR), multi-layer perceptron (MLP), and a tabular Res Net (TRN) recently proposed by Gorishniy et al. [10]. We sweep through a range of hyperparameter settings, trying a total of 408 model-hyperparameter configurations per dataset. For each configuration, we pick a random seed and use it to control a shuffled split of the development dataset into train and validation sets (70 / 30).
Hardware Specification	No	Our experiments were conducted on a GPU accelerated computing cluster.
Software Dependencies	No	ML models were written in Py Torch [26], and the analysis used Num Py [12] and Matplotlib [13] 1 2.
Experiment Setup	Yes	We sweep through a range of hyperparameter settings, trying a total of 408 model-hyperparameter configurations per dataset. For each configuration, we pick a random seed and use it to control a shuffled split of the development dataset into train and validation sets (70 / 30). This seed also controls the randomness used in training (optimization). We fit the models using Adam [15] with a patience-based stopping criteria on the validation set. We also up-weight the rare class, creating a balanced loss. We repeat this process with 3 random seeds per configuration, obtaining a total of 1224 model instances per dataset.