Optimizing Estimators of Squared Calibration Errors in Classification

Authors: Sebastian Gregor Gruber, Francis R. Bach

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on real-world image classification tasks. [...] Section 5 Experiments
Researcher Affiliation Academia Sebastian G. Gruber EMAIL German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, Germany, Frankfurt am Main, Germany German Cancer Research Center (DKFZ), Heidelberg, Germany Goethe University Frankfurt, Germany. Francis Bach Inria, Ecole Normale Supérieure, PSL Research University, Paris, France
Pseudocode Yes Algorithm 1 Evaluating the calibration of a given classifier and dataset by optimizing the calibration estimator. The evaluation dataset is split into a holdout set for estimating the calibration error, and another set, which is used for optimizing the calibration estimator via cross-validation.
Open Source Code Yes The source code is publicly available at https://github.com/Seb GGruber/Optimizing_ Calibration_Estimators.
Open Datasets Yes The image classification datasets in use are CIFAR10 with 10 classes, CIFAR100 with 100 classes (Krizhevsky, 2009), and Image Net with 1,000 classes (Deng et al., 2009). [...] We train the Vision Transformer architecture (Dosovitskiy et al., 2020) on the Med MNIST datasets (Yang et al., 2021).
Dataset Splits Yes We run the calibration-evaluation pipeline proposed in Section 3.2.2 with a random split of the original test set, using 80% for tuning the calibration estimator function via cross-validation and 20% for the calibration test set Dte, which computes the mean in Equation (21). In all experiments, we use 5-fold cross-validation to optimize the hyperparameters of a calibration estimator function.
Hardware Specification Yes All experiments are run on an Intel(R) Xeon(R) Gold 5218R with 2.1 GHz and a Macbook Pro M1.
Software Dependencies No The paper mentions using an 'implementation of the calibration estimator function hkde given by the original authors (Popordanoska et al., 2022b)' and 'a pre-trained classifier from Huggingface2 and fine-tune with a modification of (Capelle, 2022)'. However, it does not provide specific version numbers for these or other software libraries used in their own methodology.
Experiment Setup Yes As hyperparameter search spaces for the TCE experiments, we consider {5i i = 1, . . . , 20} for the number of bins in hbin, a bandwidth in {10 5(i 1)/14 (1 (i 1)/14)) i = 1, . . . , 15} {0.2i i = 1, . . . , 5} for the Dirichlet kernel of hkde according to Popordanoska et al. (2022a), a regularization constant λ {n0.510 2i+1 i = 1, . . . , 9} for hkkr, and λ {n0.510 i i = 1, . . . , 9} for hukkr. For the CCE experiments, we consider the same set of bandwidths for the Dirichlet kernel of hkde, a regularization constant λ {n0.510 i+9 i = 1, . . . , 18} for hkkr, and λ {n0.510 0.5i+4.5 i = 1, . . . , 18} for hukkr.