reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Black-Box Test-Time Prompt Tuning for Vision-Language Models

Authors: Fan'an Meng, Chaoran Cui, Hongjun Dai, Shuai Gong

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across 15 datasets demonstrate the superiority of B2TPT. The results show that B2TPT not only outperforms CLIP s zero-shot inference at test time, but also surpasses other gradient-based TPT methods.
Researcher Affiliation	Academia	Fan an Meng1, Chaoran Cui1, Hongjun Dai2, Shuai Gong1 1Shandong University of Finance and Economics 2Shandong University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: The training process of B2TPT.
Open Source Code	Yes	1https://github.com/MFAaaaaaa/B2TPT
Open Datasets	Yes	We conducted experiments on two benchmarks for test-time prompt tuning: the out-of-distribution (OOD) benchmark and the cross-dataset benchmark. In the OOD setting, we evaluate the model s robustness to natural distributional shifts on the following four Image Net variants considered as out-of-distribution (OOD) data, based on Image Net (Deng et al. 2009): Image Net A (Hendrycks et al. 2021b), Image Net-V2 (Recht et al. 2019), Image Net-R (Hendrycks et al. 2021a), Image Net Sketch (Wang et al. 2019). For the cross-dataset setting, on the other hand, we evaluate the model s performance across 10 diverse image classification datasets: Flowers102 (Nilsback and Zisserman 2008), texture classification with DTD (Cimpoi et al. 2014), fine-grained image recognition with Oxford Pets (Parkhi et al. 2012), Stanford Cars (Krause et al. 2013), action classification with UCF101 (Soomro, Zamir, and Shah 2012), general objects classification with Caltech101 (Fei-Fei, Fergus, and Perona 2004), Food101 (Bossard, Guillaumin, and Van Gool 2014), scene recognition with SUN397 (Xiao et al. 2010), Aircraft (Maji et al. 2013), and satellite image classification with Euro SAT (Helber et al. 2019).
Dataset Splits	No	The paper mentions using several benchmark datasets and
Hardware Specification	Yes	All experiments were conducted in Py Torch using NVIDIA Ge Force RTX 4090 GPUs.
Software Dependencies	No	The paper mentions "Py Torch" and "CLIP" but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We used the CMA-ES algorithm to optimize the text and vision prompts, setting the batch-size to 32. For clarity, Table 1 provides the default configuration of hyperparameters used in our experiments. Table 1: Hyperparameters Default Value Batch-size B 32 Iteration Size I 4 Population Size λ 30 Vision Prompt Length Lv 5 Text Prompt Length Lt 8 Intrinsic Dimension d1 + d2 200