Reassessing How to Compare and Improve the Calibration of Machine Learning Models
Authors: Muthu Chidambaram, Rong Ge
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments. To drive home the point, we revisit the experiments of Guo et al. (2017), but focus on one of the more modern architectural choices of Minderer et al. (2021). Namely, we evaluate the most downloaded pretrained vision transformer (Vi T) model1 (Dosovitskiy et al., 2021; Steiner et al., 2021) available through the timm library (Wightman, 2019) with respect to binned ECE, binned ACE (Nixon et al., 2019), Smooth ECE (Błasiok & Nakkiran, 2023), negative log-likelihood, and MSE on Image Net-1K-Val (Russakovsky et al., 2015). We split the data into 10,000 calibration samples (20% split) and 40,000 test samples, and compare the unmodified test performance to temperature scaling (TS), histogram binning (HB), isotonic regression (IR), and our proposed MRR. We follow the same TS implementation as Guo et al. (2017) and also use 15 bins for the binning estimators (ECE, ACE, HB) to be comparable to the results in their work. The results shown in Table 1 are not sensitive to the choice of model; we show similar results for several other popular timm models in Appendix F. |
| Researcher Affiliation | Academia | Muthu Chidambaram & Rong Ge Department of Computer Science, Duke University |
| Pseudocode | No | The paper describes algorithms and methods verbally and mathematically but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code. |
| Open Source Code | Yes | Code for our visualizations is available as the Python package sharpcal. |
| Open Datasets | Yes | Namely, we evaluate the most downloaded pretrained vision transformer (Vi T) model1 (Dosovitskiy et al., 2021; Steiner et al., 2021) available through the timm library (Wightman, 2019) with respect to binned ECE, binned ACE (Nixon et al., 2019), Smooth ECE (Błasiok & Nakkiran, 2023), negative log-likelihood, and MSE on Image Net-1K-Val (Russakovsky et al., 2015). F.2 CIFAR EXPERIMENTS We used Image Net in the main paper due to the plethora of high quality, pretrained models available via open source libraries. To show that our observations extend to other datasets, however, we also include experiments on CIFAR-10 and CIFAR-100. |
| Dataset Splits | Yes | We split the data into 10,000 calibration samples (20% split) and 40,000 test samples, and compare the unmodified test performance to temperature scaling (TS), histogram binning (HB), isotonic regression (IR), and our proposed MRR. ...calibrate them using 20% of the CIFAR-10 and CIFAR-100 test sets to be consistent with our Image Net experiments. |
| Hardware Specification | Yes | All of our experiments were done in Py Torch (Paszke et al., 2019) on a single A5000 GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch (Paszke et al., 2019)' and the 'timm library', as well as a 'Python package sharpcal', but does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | We split the data into 10,000 calibration samples (20% split) and 40,000 test samples, and compare the unmodified test performance to temperature scaling (TS), histogram binning (HB), isotonic regression (IR), and our proposed MRR. We follow the same TS implementation as Guo et al. (2017) and also use 15 bins for the binning estimators (ECE, ACE, HB) to be comparable to the results in their work. The kernel regressions estimates for the visualized components are computed using a Gaussian kernel with bandwidth σ = 0.05; we discuss different choices of kernel and bandwidth in Appendix H. |