reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

T-Cal: An Optimal Test for the Calibration of Predictive Models

Authors: Donghwan Lee, Xinmeng Huang, Hamed Hassani, Edgar Dobriban

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which combined with classical tests for discrete-valued predictors can be used to test the calibration of virtually any probabilistic classification method. ... We support our theoretical results with a broad range of experiments. We provide simulations, which support our theoretical optimality results. We also provide experiments with several popular deep neural net architectures (Res Net-50, VGG-19, Dense Net-121, etc), on benchmark datasets (CIFAR 10 and 100, Image Net) and several standard post-hoc calibration methods (Platt scaling, histogram binning, isotonic regression, etc).
Researcher Affiliation	Academia	Donghwan Lee EMAIL Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104-6340, USA; Xinmeng Huang EMAIL Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104-6340, USA; Hamed Hassani EMAIL Department of Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104-6340, USA; Edgar Dobriban EMAIL Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104-6340, USA
Pseudocode	Yes	Algorithm 1 T-Cal: an optimal test for calibration (based on debiased plug-in estimation of the calibration error); Algorithm 2 Adaptive T-Cal: an adaptive test for calibration; Algorithm 3 Sample splitting calibration test ξsplit n
Open Source Code	Yes	T-Cal is available at https://github.com/dh7401/T-Cal. Our numerical results can be reproduced with code available at https://github.com/ dh7401/T-Cal.
Open Datasets	Yes	We also provide experiments with several popular deep neural net architectures (Res Net-50, VGG-19, Dense Net-121, etc), on benchmark datasets (CIFAR 10 and 100, Image Net) and several standard post-hoc calibration methods (Platt scaling, histogram binning, isotonic regression, etc).
Dataset Splits	Yes	To this end, we split the original dataset of 10, 000 images into 2 sets of sizes 2, 000 and 8, 000. The first set is used to calibrate the model, and the second is used to perform adaptive T-Cal and calculate the empirical ℓ1-ECE. ... The test set provided by CIFAR-100 is split into two parts, containing 2, 000 and 8, 000 images, respectively. ... We split the validation set of 50, 000 images into a calibration set and a test set of sizes 10, 000 and 40, 000, respectively.
Hardware Specification	No	The paper does not explicitly describe any specific hardware (like GPU models, CPU models, or memory specifications) used for running its experiments. It mentions neural network architectures and datasets, but not the underlying computational resources.
Software Dependencies	No	The paper mentions 'PyTorch' and 'torchvision package' but does not provide specific version numbers for these or any other software dependencies, which are required for a reproducible description.
Experiment Setup	Yes	In polynomial scaling, we use polynomials of order 3 to do regression on all the prediction-label pairs (Zi, Yi), and truncate the calibrated prediction values into the interval [0, 1]. We set the binning scheme in both histogram binning and scaling binning as 15 equal-mass bins. ... we set the polynomial degree as five in polynomial scaling.