Black-Box Test-Time Prompt Tuning for Vision-Language Models

Authors: Fan'an Meng, Chaoran Cui, Hongjun Dai, Shuai Gong

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across 15 datasets demonstrate the superiority of B2TPT. The results show that B2TPT not only outperforms CLIP s zero-shot inference at test time, but also surpasses other gradient-based TPT methods.
Researcher Affiliation Academia Fan an Meng1, Chaoran Cui1*, Hongjun Dai2*, Shuai Gong1 1Shandong University of Finance and Economics 2Shandong University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: The training process of B2TPT.
Open Source Code Yes 1https://github.com/MFAaaaaaa/B2TPT
Open Datasets Yes We conducted experiments on two benchmarks for test-time prompt tuning: the out-of-distribution (OOD) benchmark and the cross-dataset benchmark. In the OOD setting, we evaluate the model s robustness to natural distributional shifts on the following four Image Net variants considered as out-of-distribution (OOD) data, based on Image Net (Deng et al. 2009): Image Net A (Hendrycks et al. 2021b), Image Net-V2 (Recht et al. 2019), Image Net-R (Hendrycks et al. 2021a), Image Net Sketch (Wang et al. 2019). For the cross-dataset setting, on the other hand, we evaluate the model s performance across 10 diverse image classification datasets: Flowers102 (Nilsback and Zisserman 2008), texture classification with DTD (Cimpoi et al. 2014), fine-grained image recognition with Oxford Pets (Parkhi et al. 2012), Stanford Cars (Krause et al. 2013), action classification with UCF101 (Soomro, Zamir, and Shah 2012), general objects classification with Caltech101 (Fei-Fei, Fergus, and Perona 2004), Food101 (Bossard, Guillaumin, and Van Gool 2014), scene recognition with SUN397 (Xiao et al. 2010), Aircraft (Maji et al. 2013), and satellite image classification with Euro SAT (Helber et al. 2019).
Dataset Splits No The paper mentions using several benchmark datasets and
Hardware Specification Yes All experiments were conducted in Py Torch using NVIDIA Ge Force RTX 4090 GPUs.
Software Dependencies No The paper mentions "Py Torch" and "CLIP" but does not provide specific version numbers for these software components.
Experiment Setup Yes We used the CMA-ES algorithm to optimize the text and vision prompts, setting the batch-size to 32. For clarity, Table 1 provides the default configuration of hyperparameters used in our experiments. Table 1: Hyperparameters Default Value Batch-size B 32 Iteration Size I 4 Population Size λ 30 Vision Prompt Length Lv 5 Text Prompt Length Lt 8 Intrinsic Dimension d1 + d2 200