reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimizing Estimators of Squared Calibration Errors in Classification

Authors: Sebastian Gregor Gruber, Francis R. Bach

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on real-world image classification tasks. [...] Section 5 Experiments
Researcher Affiliation	Academia	Sebastian G. Gruber EMAIL German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, Germany, Frankfurt am Main, Germany German Cancer Research Center (DKFZ), Heidelberg, Germany Goethe University Frankfurt, Germany. Francis Bach Inria, Ecole Normale Supérieure, PSL Research University, Paris, France
Pseudocode	Yes	Algorithm 1 Evaluating the calibration of a given classifier and dataset by optimizing the calibration estimator. The evaluation dataset is split into a holdout set for estimating the calibration error, and another set, which is used for optimizing the calibration estimator via cross-validation.
Open Source Code	Yes	The source code is publicly available at https://github.com/Seb GGruber/Optimizing_ Calibration_Estimators.
Open Datasets	Yes	The image classification datasets in use are CIFAR10 with 10 classes, CIFAR100 with 100 classes (Krizhevsky, 2009), and Image Net with 1,000 classes (Deng et al., 2009). [...] We train the Vision Transformer architecture (Dosovitskiy et al., 2020) on the Med MNIST datasets (Yang et al., 2021).
Dataset Splits	Yes	We run the calibration-evaluation pipeline proposed in Section 3.2.2 with a random split of the original test set, using 80% for tuning the calibration estimator function via cross-validation and 20% for the calibration test set Dte, which computes the mean in Equation (21). In all experiments, we use 5-fold cross-validation to optimize the hyperparameters of a calibration estimator function.
Hardware Specification	Yes	All experiments are run on an Intel(R) Xeon(R) Gold 5218R with 2.1 GHz and a Macbook Pro M1.
Software Dependencies	No	The paper mentions using an 'implementation of the calibration estimator function hkde given by the original authors (Popordanoska et al., 2022b)' and 'a pre-trained classifier from Huggingface2 and fine-tune with a modification of (Capelle, 2022)'. However, it does not provide specific version numbers for these or other software libraries used in their own methodology.
Experiment Setup	Yes	As hyperparameter search spaces for the TCE experiments, we consider {5i i = 1, . . . , 20} for the number of bins in hbin, a bandwidth in {10 5(i 1)/14 (1 (i 1)/14)) i = 1, . . . , 15} {0.2i i = 1, . . . , 5} for the Dirichlet kernel of hkde according to Popordanoska et al. (2022a), a regularization constant λ {n0.510 2i+1 i = 1, . . . , 9} for hkkr, and λ {n0.510 i i = 1, . . . , 9} for hukkr. For the CCE experiments, we consider the same set of bandwidths for the Dirichlet kernel of hkde, a regularization constant λ {n0.510 i+9 i = 1, . . . , 18} for hkkr, and λ {n0.510 0.5i+4.5 i = 1, . . . , 18} for hukkr.