Calibrated Selective Classification
Authors: Adam Fisch, Tommi S. Jaakkola, Regina Barzilay
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.1... We demonstrate consistent empirical reductions in selective calibration error metrics... across multiple tasks and datasets. |
| Researcher Affiliation | Academia | Adam Fisch EMAIL Tommi Jaakkola EMAIL Regina Barzilay EMAIL Computer Science and Artificial Intelligence Laboratory (CSAIL) Massachusetts Institute of Technology, Cambridge, MA, 02142, USA. |
| Pseudocode | Yes | Algorithm 1 Robust training for calibrated selective classification. |
| Open Source Code | Yes | Our code is available at https://github.com/ajfisch/calibrated-selective-classification. |
| Open Datasets | Yes | The CIFAR-10 dataset (Krizhevsky, 2012)... The Image Net dataset (Deng et al., 2009)... data and models from Mikhael et al. (2022), all of the which was subject to IRB approval (including usage for this study). The base model for f is a 3D CNN trained on scans from the National Lung Screening Trial (NLST) data (Aberle et al., 2011). |
| Dataset Splits | Yes | The CIFAR-10 dataset... has 50k images for training and 10k for testing. We remove 5k images each from the training set for validation and perturbation datasets for training g... We then use a separate split of 6,282 scans from the NLST data to train g... We evaluate our selective classifier on new set of 1,337 scans obtained from Massachusetts General Hospital (MGH). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | All models are trained in Py Torch (Paszke et al., 2019), while high-level input features ( 4.5) for g are computed with sklearn (Pedregosa et al., 2011). The paper mentions software but does not specify version numbers for PyTorch or scikit-learn. |
| Experiment Setup | Yes | We compute the S-MMCE with samples of size 1024 (of examples with the same perturbation t T applied), with a combined batch size of m = 32. We train all models for 5 epochs, with 50k samples (where one sample is a batch D of 1024 perturbed examples) per epoch ( 7.8k updates total at the combined batch size of 32). The loss hyper-parameters λ1 and λ2 were set to |D| 0.5 = 1/32 and 1e-2 |D| 1 1e-5, respectively. The κ in our top-κ calibration error loss is set to 4. |