T-Cal: An Optimal Test for the Calibration of Predictive Models
Authors: Donghwan Lee, Xinmeng Huang, Hamed Hassani, Edgar Dobriban
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which combined with classical tests for discrete-valued predictors can be used to test the calibration of virtually any probabilistic classification method. ... We support our theoretical results with a broad range of experiments. We provide simulations, which support our theoretical optimality results. We also provide experiments with several popular deep neural net architectures (Res Net-50, VGG-19, Dense Net-121, etc), on benchmark datasets (CIFAR 10 and 100, Image Net) and several standard post-hoc calibration methods (Platt scaling, histogram binning, isotonic regression, etc). |
| Researcher Affiliation | Academia | Donghwan Lee EMAIL Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104-6340, USA; Xinmeng Huang EMAIL Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104-6340, USA; Hamed Hassani EMAIL Department of Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104-6340, USA; Edgar Dobriban EMAIL Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104-6340, USA |
| Pseudocode | Yes | Algorithm 1 T-Cal: an optimal test for calibration (based on debiased plug-in estimation of the calibration error); Algorithm 2 Adaptive T-Cal: an adaptive test for calibration; Algorithm 3 Sample splitting calibration test ξsplit n |
| Open Source Code | Yes | T-Cal is available at https://github.com/dh7401/T-Cal. Our numerical results can be reproduced with code available at https://github.com/ dh7401/T-Cal. |
| Open Datasets | Yes | We also provide experiments with several popular deep neural net architectures (Res Net-50, VGG-19, Dense Net-121, etc), on benchmark datasets (CIFAR 10 and 100, Image Net) and several standard post-hoc calibration methods (Platt scaling, histogram binning, isotonic regression, etc). |
| Dataset Splits | Yes | To this end, we split the original dataset of 10, 000 images into 2 sets of sizes 2, 000 and 8, 000. The first set is used to calibrate the model, and the second is used to perform adaptive T-Cal and calculate the empirical ℓ1-ECE. ... The test set provided by CIFAR-100 is split into two parts, containing 2, 000 and 8, 000 images, respectively. ... We split the validation set of 50, 000 images into a calibration set and a test set of sizes 10, 000 and 40, 000, respectively. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware (like GPU models, CPU models, or memory specifications) used for running its experiments. It mentions neural network architectures and datasets, but not the underlying computational resources. |
| Software Dependencies | No | The paper mentions 'PyTorch' and 'torchvision package' but does not provide specific version numbers for these or any other software dependencies, which are required for a reproducible description. |
| Experiment Setup | Yes | In polynomial scaling, we use polynomials of order 3 to do regression on all the prediction-label pairs (Zi, Yi), and truncate the calibrated prediction values into the interval [0, 1]. We set the binning scheme in both histogram binning and scaling binning as 15 equal-mass bins. ... we set the polynomial degree as five in polynomial scaling. |