reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Extending Temperature Scaling with Homogenizing Maps

Authors: Christopher Qian, Feng Liang, Jason Adams

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the advantage of our method over temperature scaling in both calibration and out-of-distribution detection. Additionally, we extend our methodology and experimental evaluation to recalibration in the Bayesian setting.
Researcher Affiliation	Collaboration	Christopher Qian EMAIL Department of Statistics University of Illinois Urbana-Champaign Champaign, IL 61820, USA, Feng Liang EMAIL Department of Statistics University of Illinois Urbana-Champaign Champaign, IL 61820, USA, Jason Adams EMAIL Sandia National Laboratories Albuquerque, NM 87123, USA
Pseudocode	No	The paper describes methods in paragraph text and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It mentions using a PyTorch implementation for base models from a GitHub link, but this is third-party code used, not their own implementation for the proposed methods.
Open Datasets	Yes	We consider a neural network trained on CIFAR-100 (Krizhevsky, 2009). In the right part of Figure 1, we show the prediction on a Street View House Numbers (Netzer et al., 2011) (SVHN) image; the model is 80% conﬁdent that the image is tiger. ...we add 2000 observations from the classroom split of the LSUN (Yu et al., 2015) data set to the test data set.
Dataset Splits	Yes	We use the standard training split of CIFAR-100 to train ﬁve models... The standard test split of CIFAR-100 consists of 10000 observations. We randomly sample 8000 observations to create the validation data set D, which we will use to learn the recalibration mappings for each method. We use the remaining 2000 observations for testing. ...we randomly divide the size-10,000 test data set into a validation data set of size 8,000 and a test data set of size 2,000. In addition, we randomly sample 2000 observations from the OOD data set to compute the OOD detection metrics.
Hardware Specification	No	The paper mentions that "This work made use of the Illinois Campus Cluster, a computing resource..." but does not specify any particular GPU or CPU models, processor types, or memory amounts used for the experiments.
Software Dependencies	No	The paper mentions using "Py Torch implementation" but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup	Yes	We train each model for 150 epochs using the default parameters from the implementation: SGD optimizer with learning rate 0.1, momentum 0.9, and weight decay 5e-4.