Extending Temperature Scaling with Homogenizing Maps

Authors: Christopher Qian, Feng Liang, Jason Adams

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the advantage of our method over temperature scaling in both calibration and out-of-distribution detection. Additionally, we extend our methodology and experimental evaluation to recalibration in the Bayesian setting.
Researcher Affiliation Collaboration Christopher Qian EMAIL Department of Statistics University of Illinois Urbana-Champaign Champaign, IL 61820, USA, Feng Liang EMAIL Department of Statistics University of Illinois Urbana-Champaign Champaign, IL 61820, USA, Jason Adams EMAIL Sandia National Laboratories Albuquerque, NM 87123, USA
Pseudocode No The paper describes methods in paragraph text and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It mentions using a PyTorch implementation for base models from a GitHub link, but this is third-party code used, not their own implementation for the proposed methods.
Open Datasets Yes We consider a neural network trained on CIFAR-100 (Krizhevsky, 2009). In the right part of Figure 1, we show the prediction on a Street View House Numbers (Netzer et al., 2011) (SVHN) image; the model is 80% confident that the image is tiger. ...we add 2000 observations from the classroom split of the LSUN (Yu et al., 2015) data set to the test data set.
Dataset Splits Yes We use the standard training split of CIFAR-100 to train five models... The standard test split of CIFAR-100 consists of 10000 observations. We randomly sample 8000 observations to create the validation data set D, which we will use to learn the recalibration mappings for each method. We use the remaining 2000 observations for testing. ...we randomly divide the size-10,000 test data set into a validation data set of size 8,000 and a test data set of size 2,000. In addition, we randomly sample 2000 observations from the OOD data set to compute the OOD detection metrics.
Hardware Specification No The paper mentions that "This work made use of the Illinois Campus Cluster, a computing resource..." but does not specify any particular GPU or CPU models, processor types, or memory amounts used for the experiments.
Software Dependencies No The paper mentions using "Py Torch implementation" but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes We train each model for 150 epochs using the default parameters from the implementation: SGD optimizer with learning rate 0.1, momentum 0.9, and weight decay 5e-4.