On Temperature Scaling and Conformal Prediction of Deep Classifiers
Authors: Lahav Dabah, Tom Tirer
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We start this paper with an extensive empirical study of its effect on prominent CP methods. We show that while TS calibration improves the class-conditional coverage of adaptive CP methods, surprisingly, it negatively affects their prediction set sizes. Motivated by this behavior, we explore the effect of TS on CP beyond its calibration application and reveal an intriguing trend under which it allows to trade prediction set size and conditional coverage of adaptive CP methods. Then, we establish a mathematical theory that explains the entire non-monotonic trend. Finally, based on our experiments and theory, we offer guidelines for practitioners to effectively combine adaptive CP with calibration, aligned with user-defined goals. |
| Researcher Affiliation | Academia | 1Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel. Correspondence to: Lahav Dabah <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes in narrative text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Code for reproducing our experiments and applying our guidelines is available at https://github.com/lahavdabah/TS4CP. |
| Open Datasets | Yes | Datasets. We conducted our experiment on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Image Net (Deng et al., 2009) chosen for their diverse content and varying levels of difficulty. |
| Dataset Splits | Yes | TS calibration. For each dataset-model pair, we create a calibration set by randomly selecting 10% of the validation set. We obtain the calibration temperature T by optimizing the ECE objective. The optimal temperatures when using the NLL objective are very similar, as displayed in Table 3 in the Appendix B.3.1. This justifies using ECE as the default for the experiments. CP Algorithms. For each of the dataset-model pairs, we construct the CP set (used for computing the thresholds of CP methods) by randomly selecting {5%, 10%, 20%} of the validation set, while ensuring not to include in the CP set samples that are used in the TS calibration. |
| Hardware Specification | Yes | We conducted our experiments using an NVIDIA Ge Force GTX 1080 Ti. |
| Software Dependencies | No | The paper mentions using "TORCHVISION.MODELS sub-package" and training with general parameters (like SGD optimizer, Cross-Entropy loss) but does not provide specific version numbers for software libraries or programming languages required for reproduction. |
| Experiment Setup | Yes | For CIFAR-100 and CIFAR-10 models, we use: Batch size: 128; Epochs: 300; Cross-Entropy loss; Optimizer: SGD; Learning rate: 0.1; Momentum: 0.9; Weight decay: 0.0005. |