Calibrated Selective Classification

Authors: Adam Fisch, Tommi S. Jaakkola, Regina Barzilay

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.1... We demonstrate consistent empirical reductions in selective calibration error metrics... across multiple tasks and datasets.
Researcher Affiliation Academia Adam Fisch EMAIL Tommi Jaakkola EMAIL Regina Barzilay EMAIL Computer Science and Artificial Intelligence Laboratory (CSAIL) Massachusetts Institute of Technology, Cambridge, MA, 02142, USA.
Pseudocode Yes Algorithm 1 Robust training for calibrated selective classification.
Open Source Code Yes Our code is available at https://github.com/ajfisch/calibrated-selective-classification.
Open Datasets Yes The CIFAR-10 dataset (Krizhevsky, 2012)... The Image Net dataset (Deng et al., 2009)... data and models from Mikhael et al. (2022), all of the which was subject to IRB approval (including usage for this study). The base model for f is a 3D CNN trained on scans from the National Lung Screening Trial (NLST) data (Aberle et al., 2011).
Dataset Splits Yes The CIFAR-10 dataset... has 50k images for training and 10k for testing. We remove 5k images each from the training set for validation and perturbation datasets for training g... We then use a separate split of 6,282 scans from the NLST data to train g... We evaluate our selective classifier on new set of 1,337 scans obtained from Massachusetts General Hospital (MGH).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No All models are trained in Py Torch (Paszke et al., 2019), while high-level input features ( 4.5) for g are computed with sklearn (Pedregosa et al., 2011). The paper mentions software but does not specify version numbers for PyTorch or scikit-learn.
Experiment Setup Yes We compute the S-MMCE with samples of size 1024 (of examples with the same perturbation t T applied), with a combined batch size of m = 32. We train all models for 5 epochs, with 50k samples (where one sample is a batch D of 1024 perturbed examples) per epoch ( 7.8k updates total at the combined batch size of 32). The loss hyper-parameters λ1 and λ2 were set to |D| 0.5 = 1/32 and 1e-2 |D| 1 1e-5, respectively. The κ in our top-κ calibration error loss is set to 4.