On Temperature Scaling and Conformal Prediction of Deep Classifiers

Authors: Lahav Dabah, Tom Tirer

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We start this paper with an extensive empirical study of its effect on prominent CP methods. We show that while TS calibration improves the class-conditional coverage of adaptive CP methods, surprisingly, it negatively affects their prediction set sizes. Motivated by this behavior, we explore the effect of TS on CP beyond its calibration application and reveal an intriguing trend under which it allows to trade prediction set size and conditional coverage of adaptive CP methods. Then, we establish a mathematical theory that explains the entire non-monotonic trend. Finally, based on our experiments and theory, we offer guidelines for practitioners to effectively combine adaptive CP with calibration, aligned with user-defined goals.
Researcher Affiliation Academia 1Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel. Correspondence to: Lahav Dabah <EMAIL>.
Pseudocode No The paper describes methods and processes in narrative text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Code for reproducing our experiments and applying our guidelines is available at https://github.com/lahavdabah/TS4CP.
Open Datasets Yes Datasets. We conducted our experiment on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Image Net (Deng et al., 2009) chosen for their diverse content and varying levels of difficulty.
Dataset Splits Yes TS calibration. For each dataset-model pair, we create a calibration set by randomly selecting 10% of the validation set. We obtain the calibration temperature T by optimizing the ECE objective. The optimal temperatures when using the NLL objective are very similar, as displayed in Table 3 in the Appendix B.3.1. This justifies using ECE as the default for the experiments. CP Algorithms. For each of the dataset-model pairs, we construct the CP set (used for computing the thresholds of CP methods) by randomly selecting {5%, 10%, 20%} of the validation set, while ensuring not to include in the CP set samples that are used in the TS calibration.
Hardware Specification Yes We conducted our experiments using an NVIDIA Ge Force GTX 1080 Ti.
Software Dependencies No The paper mentions using "TORCHVISION.MODELS sub-package" and training with general parameters (like SGD optimizer, Cross-Entropy loss) but does not provide specific version numbers for software libraries or programming languages required for reproduction.
Experiment Setup Yes For CIFAR-100 and CIFAR-10 models, we use: Batch size: 128; Epochs: 300; Cross-Entropy loss; Optimizer: SGD; Learning rate: 0.1; Momentum: 0.9; Weight decay: 0.0005.