reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Temperature Scaling and Conformal Prediction of Deep Classifiers

Authors: Lahav Dabah, Tom Tirer

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We start this paper with an extensive empirical study of its effect on prominent CP methods. We show that while TS calibration improves the class-conditional coverage of adaptive CP methods, surprisingly, it negatively affects their prediction set sizes. Motivated by this behavior, we explore the effect of TS on CP beyond its calibration application and reveal an intriguing trend under which it allows to trade prediction set size and conditional coverage of adaptive CP methods. Then, we establish a mathematical theory that explains the entire non-monotonic trend. Finally, based on our experiments and theory, we offer guidelines for practitioners to effectively combine adaptive CP with calibration, aligned with user-defined goals.
Researcher Affiliation	Academia	1Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel. Correspondence to: Lahav Dabah <EMAIL>.
Pseudocode	No	The paper describes methods and processes in narrative text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Code for reproducing our experiments and applying our guidelines is available at https://github.com/lahavdabah/TS4CP.
Open Datasets	Yes	Datasets. We conducted our experiment on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Image Net (Deng et al., 2009) chosen for their diverse content and varying levels of difficulty.
Dataset Splits	Yes	TS calibration. For each dataset-model pair, we create a calibration set by randomly selecting 10% of the validation set. We obtain the calibration temperature T by optimizing the ECE objective. The optimal temperatures when using the NLL objective are very similar, as displayed in Table 3 in the Appendix B.3.1. This justifies using ECE as the default for the experiments. CP Algorithms. For each of the dataset-model pairs, we construct the CP set (used for computing the thresholds of CP methods) by randomly selecting {5%, 10%, 20%} of the validation set, while ensuring not to include in the CP set samples that are used in the TS calibration.
Hardware Specification	Yes	We conducted our experiments using an NVIDIA Ge Force GTX 1080 Ti.
Software Dependencies	No	The paper mentions using "TORCHVISION.MODELS sub-package" and training with general parameters (like SGD optimizer, Cross-Entropy loss) but does not provide specific version numbers for software libraries or programming languages required for reproduction.
Experiment Setup	Yes	For CIFAR-100 and CIFAR-10 models, we use: Batch size: 128; Epochs: 300; Cross-Entropy loss; Optimizer: SGD; Learning rate: 0.1; Momentum: 0.9; Weight decay: 0.0005.