reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deep Neural Network Benchmarks for Selective Classification

Authors: Andrea Pugnana, Lorenzo Perini, Jesse Davis, Salvatore Ruggieri

DMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We fill this gap by benchmarking 18 baselines on a diverse set of 44 datasets that includes both image and tabular data. Moreover, there is a mix of binary and multiclass tasks. We evaluate these approaches using several criteria, including selective error rate, empirical coverage, distribution of rejected instance s classes, and performance on out-of-distribution instances. The results indicate that there is not a single clear winner among the surveyed baselines, and the best method depends on the users objectives.
Researcher Affiliation	Academia	Andrea Pugnana EMAIL Scuola Normale Superiore, University of Pisa, ISTI-CNR Pisa, Italy Lorenzo Perini EMAIL KU Leuven Leuven, Belgium Jesse Davis EMAIL KU Leuven Leuven, Belgium Salvatore Ruggieri EMAIL University of Pisa Pisa, Italy
Pseudocode	No	The paper describes various algorithms and architectures (e.g., Learn-to-Abstain, Learn-to-Select, Score-based methods) using mathematical formulas and architectural diagrams (Figures 1, 2, 3), but it does not contain any explicit pseudocode blocks or algorithms with numbered steps.
Open Source Code	Yes	(iv) release a public repository with all software code and datasets for reproducing the baseline algorithms and the experiments.1 1. The code is available at github.com/andrepugni/ESC/.
Open Datasets	Yes	We run experiments on 44 benchmark datasets from real-life scenarios, such as finance and healthcare (Yang et al., 2023). Among these, 20 are image data and 24 are tabular data. Moreover, 13 of these datasets were previously used in testing (at least one) the baselines in their original paper. Details are provided in Tables A1-A2 of the Appendix A.1.
Dataset Splits	Yes	For each combination of datasets and baselines, we run the following experiment: (i) we randomly split the available data into training, calibration, validation, and test sets using the proportion 60/10/10/20%
Hardware Specification	Yes	Regarding computational resources, we split the workload over three machines: (1) a 25 nodes cluster equipped with 2 16-core @ 2.7 GHz (3.3 GHz Turbo) POWER9 Processor and 4 NVIDIA Tesla V100 each, OS Red Hat Enterprise Linux release 8.4; (2) a 96 cores machine with Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz and two NVIDIA RTX A6000, OS Ubuntu 20.04.4; (3) a 128 cores machine with AMD EPYC 7502 32-Core Processor and four NVIDIA RTX A5000, OS Ubuntu 20.04.6.
Software Dependencies	No	We optimize the hyperparameters using optuna (Akiba et al., 2019), a framework for multi-objective Bayesian optimization, with the following inputs: coverage violation and cross-entropy loss as target metrics, Bo Torch as sampler (Balandat et al., 2020), 10 initial independent trials out of 20 total trials. The paper mentions Optuna and Bo Torch but does not provide specific version numbers for these or any other software components.
Experiment Setup	Yes	All networks are trained for 300 epochs. We optimize the hyperparameters using optuna (Akiba et al., 2019), a framework for multi-objective Bayesian optimization, with the following inputs: coverage violation and cross-entropy loss as target metrics, Bo Torch as sampler (Balandat et al., 2020), 10 initial independent trials out of 20 total trials. Among the 20 trials, we select the configuration that (1) has the highest accuracy on the validation set and (2) reaches the target coverage ( 0.05). Moreover, some baselines require the target coverage c to be known at training time (e.g., SELNET). For the sake of reducing the computational cost4, we optimize their hyperparameters using only three values c {.99, .85, .70} and fix the best-performing architecture for all target coverages. Moreover, SCROSS, AUCROSS, ENS, ENS+SR and PLUGINAUC use the same optimal hyperparameters found for SR as they share the same training loss. Similarly, SELNET+SR, SELNET+EM+SR, SAT+SR and SAT+EM+SR employ the same optimal configuration as, respectively, SELNET, SELNET+EM, SAT and SAT+EM. We detail the parameter choices in Appendix A.2.