reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

Authors: Shuoyuan Wang, Yixuan Li, Hongxin Wei

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that DOR can notably enhance the calibration performance of current fine-tuning methods. Our code is available at https://github.com/ml-stat Sustech/Outlier-Calibration. We verify the effectiveness of DOR across 11 image classification datasets and 4 types of Image Nets with covariant shifts. Extensive experiments show that DOR can enhance the calibration of existing prompt-tuning methods (see Table 1).
Researcher Affiliation	Academia	1Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China. 2Department of Computer Sciences, University of Wisconsin-Madison, WI, USA. Correspondence to: Hongxin Wei <EMAIL>.
Pseudocode	No	The paper describes the method "Dynamic Outlier Regularization (DOR)" in natural language in Section 4 "Method: Dynamic Outlier Regularization" but does not provide a structured pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/ml-stat Sustech/Outlier-Calibration.
Open Datasets	Yes	For the base-to-new evaluation, we cover diverse classification tasks including Image Net (Deng et al., 2009), Caltech101 (Fei-Fei et al., 2004), Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), UCF101 (Soomro et al., 2012), DTD (Cimpoi et al., 2014) and Euro SAT (Helber et al., 2019).
Dataset Splits	Yes	A downstream dataset will be equally split into base and new classes. The model is trained only on the base classes in a few-shot setting and evaluated on base and new classes. We fine-tune the model with 16 samples per class in a few-shot setting (Zhou et al., 2022a).
Hardware Specification	No	The paper does not explicitly state the specific hardware (GPU/CPU models, memory, etc.) used for running the experiments. It only mentions the use of CLIP (Vi T-B/16) as the pre-trained VLM.
Software Dependencies	No	The paper states that implementations are based on an open-source repository and mentions specific models and tuning methods, but it does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We fine-tune the model with 16 samples per class in a few-shot setting (Zhou et al., 2022a). We list the general hyperparameters in Table 9. For hyperparameter λ in DOR, we set λ = 8.0 for Co Op, λ = 4.0 for Ma PLe and 2.0 for other fine-tuning methods.