reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Does confidence calibration improve conformal prediction?

Authors: HuaJun Xi, Jianguo Huang, Kangdao Liu, Lei Feng, Hongxin Wei

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we make two key discoveries about the impact of confidence calibration methods on adaptive conformal prediction. Firstly, we empirically show that current confidence calibration methods (e.g., temperature scaling) typically lead to larger prediction sets with lower confidence in adaptive conformal prediction. Secondly, by investigating the role of temperature value, we observe that high-confidence predictions produced by a low temperature lead to small prediction sets for adaptive conformal prediction. Theoretically, we prove that higher-confidence predictions with lower temperatures result in smaller prediction sets on expectation. This finding implies that the rescaling parameters in these calibration methods, when optimized with cross-entropy loss, might counteract the goal of generating small prediction sets. To address this issue, we propose Conformal Temperature Scaling (Conf TS), a variant of temperature scaling with a novel loss function designed to enhance the efficiency of prediction sets. This approach can be extended to optimize the parameters of other post-hoc methods of confidence calibration. Extensive experiments demonstrate that our method improves existing adaptive conformal prediction methods in both image and text classification tasks.
Researcher Affiliation	Academia	Huajun Xi EMAIL Department of Statistics and Data Science Southern University of Science and Technology Jianguo Huang EMAIL College of Computing and Data Science Nanyang Technological University Kangdao Liu EMAIL Department of Computer and Information Science University of Macau Lei Feng EMAIL Information Systems Technology and Design Pillar Singapore University of Technology and Design Hongxin Wei EMAIL Department of Statistics and Data Science Southern University of Science and Technology
Pseudocode	Yes	G Pseudo-algorithms of Conf TS, Conf PS and Conf VS In this section, we present the pseudo-algorithms of the proposed methods, including Conf TS (Algorithm 1), Conf PS (Algorithm 2), and Conf VS (Algorithm 3). The essence of our method is to train a logits rescaling function with respect to the Conf TS loss. The loss function can be replaced by Conf Tr or other loss functions designed for various targets.
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a link to a code repository for the methodology described.
Open Datasets	Yes	Datasets. We evaluate Conf TS on both image and text classification tasks. For image classification, we employ CIFAR-100 (Krizhevsky et al., 2009), Image Net (Deng et al., 2009), and Image Net-V2 (Recht et al., 2019). For text classification, we utilize AG news (Zhang et al., 2015) and DBpedia (Auer et al., 2007) datasets.
Dataset Splits	Yes	For Image Net, we split the test dataset containing 50,000 images into 10,000 images for calibration and 40,000 for testing. For CIFAR-100 and Image Net-V2, we split their test datasets, each containing 10,000 images, into 4,000 images for calibration and 6,000 for testing. For text datasets, we split each test dataset equally between calibration and testing. Additionally, we split the calibration set into two subsets of equal size: one subset is the validation set to optimize the temperature value with Conf TS, while the other half is the conformal set for conformal calibration.
Hardware Specification	No	The paper mentions various models (Res Net18, Res Net50, Res Net101, Dense Net121, VGG16, Vi T-B-16, BERT, GPT-Neo-1.3B) but does not specify the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions 'Torch Vision (Paszke et al., 2019)' and 'Adam W optimizer' but does not provide specific version numbers for these or other software libraries and dependencies.
Experiment Setup	Yes	For each dataset, we employ the Adam W optimizer with a learning rate of 2e-5. The training is conducted over 3 epochs with a batch size of 32. The models are trained for 100 epochs using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. We set the initial learning rate as 0.1 and reduce it by a factor of 5 at 60 epochs. Conformal prediction algorithms. We leverage three adaptive conformal prediction methods, APS and RAPS, to generate prediction sets at error rate α {0.1, 0.05}. In addition, we set the regularization hyperparameter for RAPS to be: kreg = 1 and λ {0.001, 0.002, 0.004, 0.006, 0.01, 0.015, 0.02}.