Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

Authors: Shuoyuan Wang, Yixuan Li, Hongxin Wei

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that DOR can notably enhance the calibration performance of current fine-tuning methods. Our code is available at https://github.com/ml-stat Sustech/Outlier-Calibration. We verify the effectiveness of DOR across 11 image classification datasets and 4 types of Image Nets with covariant shifts. Extensive experiments show that DOR can enhance the calibration of existing prompt-tuning methods (see Table 1).
Researcher Affiliation Academia 1Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China. 2Department of Computer Sciences, University of Wisconsin-Madison, WI, USA. Correspondence to: Hongxin Wei <EMAIL>.
Pseudocode No The paper describes the method "Dynamic Outlier Regularization (DOR)" in natural language in Section 4 "Method: Dynamic Outlier Regularization" but does not provide a structured pseudocode or algorithm block.
Open Source Code Yes Our code is available at https://github.com/ml-stat Sustech/Outlier-Calibration.
Open Datasets Yes For the base-to-new evaluation, we cover diverse classification tasks including Image Net (Deng et al., 2009), Caltech101 (Fei-Fei et al., 2004), Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), UCF101 (Soomro et al., 2012), DTD (Cimpoi et al., 2014) and Euro SAT (Helber et al., 2019).
Dataset Splits Yes A downstream dataset will be equally split into base and new classes. The model is trained only on the base classes in a few-shot setting and evaluated on base and new classes. We fine-tune the model with 16 samples per class in a few-shot setting (Zhou et al., 2022a).
Hardware Specification No The paper does not explicitly state the specific hardware (GPU/CPU models, memory, etc.) used for running the experiments. It only mentions the use of CLIP (Vi T-B/16) as the pre-trained VLM.
Software Dependencies No The paper states that implementations are based on an open-source repository and mentions specific models and tuning methods, but it does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We fine-tune the model with 16 samples per class in a few-shot setting (Zhou et al., 2022a). We list the general hyperparameters in Table 9. For hyperparameter λ in DOR, we set λ = 8.0 for Co Op, λ = 4.0 for Ma PLe and 2.0 for other fine-tuning methods.