reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reassessing How to Compare and Improve the Calibration of Machine Learning Models

Authors: Muthu Chidambaram, Rong Ge

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments. To drive home the point, we revisit the experiments of Guo et al. (2017), but focus on one of the more modern architectural choices of Minderer et al. (2021). Namely, we evaluate the most downloaded pretrained vision transformer (Vi T) model1 (Dosovitskiy et al., 2021; Steiner et al., 2021) available through the timm library (Wightman, 2019) with respect to binned ECE, binned ACE (Nixon et al., 2019), Smooth ECE (Błasiok & Nakkiran, 2023), negative log-likelihood, and MSE on Image Net-1K-Val (Russakovsky et al., 2015). We split the data into 10,000 calibration samples (20% split) and 40,000 test samples, and compare the unmodified test performance to temperature scaling (TS), histogram binning (HB), isotonic regression (IR), and our proposed MRR. We follow the same TS implementation as Guo et al. (2017) and also use 15 bins for the binning estimators (ECE, ACE, HB) to be comparable to the results in their work. The results shown in Table 1 are not sensitive to the choice of model; we show similar results for several other popular timm models in Appendix F.
Researcher Affiliation	Academia	Muthu Chidambaram & Rong Ge Department of Computer Science, Duke University
Pseudocode	No	The paper describes algorithms and methods verbally and mathematically but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code.
Open Source Code	Yes	Code for our visualizations is available as the Python package sharpcal.
Open Datasets	Yes	Namely, we evaluate the most downloaded pretrained vision transformer (Vi T) model1 (Dosovitskiy et al., 2021; Steiner et al., 2021) available through the timm library (Wightman, 2019) with respect to binned ECE, binned ACE (Nixon et al., 2019), Smooth ECE (Błasiok & Nakkiran, 2023), negative log-likelihood, and MSE on Image Net-1K-Val (Russakovsky et al., 2015). F.2 CIFAR EXPERIMENTS We used Image Net in the main paper due to the plethora of high quality, pretrained models available via open source libraries. To show that our observations extend to other datasets, however, we also include experiments on CIFAR-10 and CIFAR-100.
Dataset Splits	Yes	We split the data into 10,000 calibration samples (20% split) and 40,000 test samples, and compare the unmodified test performance to temperature scaling (TS), histogram binning (HB), isotonic regression (IR), and our proposed MRR. ...calibrate them using 20% of the CIFAR-10 and CIFAR-100 test sets to be consistent with our Image Net experiments.
Hardware Specification	Yes	All of our experiments were done in Py Torch (Paszke et al., 2019) on a single A5000 GPU.
Software Dependencies	No	The paper mentions 'Py Torch (Paszke et al., 2019)' and the 'timm library', as well as a 'Python package sharpcal', but does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	We split the data into 10,000 calibration samples (20% split) and 40,000 test samples, and compare the unmodified test performance to temperature scaling (TS), histogram binning (HB), isotonic regression (IR), and our proposed MRR. We follow the same TS implementation as Guo et al. (2017) and also use 15 bins for the binning estimators (ECE, ACE, HB) to be comparable to the results in their work. The kernel regressions estimates for the visualized components are computed using a Gaussian kernel with bandwidth σ = 0.05; we discuss different choices of kernel and bandwidth in Appendix H.