ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models

Authors: Seonghwan Park, Jaehyeon Jeong, Yongjun Kim, Jaeho Lee, Namhoon Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate ZIP on 13+ vision-language tasks in standard benchmarks and show that it achieves an average improvement of approximately 6% in few-shot accuracy and 48% in query efficiency compared to the best-performing alternative BBPT methods, establishing a new state of the art. Our ablation analysis further shows that the proposed clipping mechanism is robust and nearly optimal, without the need to manually select the clipping threshold, matching the result of expensive hyperparameter search.
Researcher Affiliation Academia Seonghwan Park1, Jaehyeon Jeong1, Yongjun Kim1, Jaeho Lee1,2, Namhoon Lee1,2 1POSTECH, 2Yonsei University EMAIL
Pseudocode Yes To ensure clarity and provide a comprehensive understanding of the training procedure, the summarized training algorithm can be found in Algorithm 1, which outlines each stage of the process for easy reference.
Open Source Code Yes The implementation is available at https://github.com/LOG-postech/ZIP.
Open Datasets Yes To assess the query efficiency and performance of ZIP, we conduct evaluations on standard generalization tasks following the protocols of Zhou et al. (2022a;b); Oh et al. (2023). These tasks include few-shot learning, base-to-new generalization, cross-dataset transfer, and outof-distribution (OOD) generalization. For few-shot learning, base-to-new generalization, and crossdataset transfer, we evaluate ZIP across 13 diverse image classification tasks: Image Net (Deng et al., 2009), Caltech101 (Fei-Fei et al., 2004), Oxford Pets (Parkhi et al., 2012), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), Resisc45 (Cheng et al., 2017), DTD (Cimpoi et al., 2014), SVHN (Netzer et al., 2011), Euro SAT (Helber et al., 2019), CLEVR (Johnson et al., 2017), and UCF101 (Soomro et al., 2012). For evaluating OOD generalization, we employ four established OOD datasets to measure the robustness of ZIP under distribution shifts: Image Net V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a).
Dataset Splits Yes For all baselines, we follow the standardized few-shot evaluation protocol across datasets, consistent with Zhou et al. (2022b); Oh et al. (2023), which includes specific few-shot splits to ensure a fair comparison. [...] All the results are based on 16-shots per class. [...] Table 15 (partial quote: "Image Net 1.28M N/A 50,000 Generic object a photo of a [CLASS]. Caltech101 4,128 1,649 2,465 Generic object a photo of a [CLASS].")
Hardware Specification Yes We conduct our experiments on NVIDIA 3090, A6000, A100 and Intel Gaudi-v2 GPUs.
Software Dependencies No No specific versions for key software components (like Python, PyTorch, or other libraries) are provided.
Experiment Setup Yes We consistently set the number of context tokens m as 8 for ZIP and use 5,000 queries across all tasks for all BBPT baselines. The number of the intrinsic dimensionality d is set to 500, and the rank of low-rank matrices r = 5, resulting in a total of 417 learnable parameters δ with the formula r( d /m + m + 1) + d /m . Following the previous works for transfer learning (Zhou et al., 2022a;b); Oh et al. (2023), we initialize soft prompts from prompts derived from source tasks. We use the official code to reproduce BBPT baselines, and the results are averaged over three different random seeds. [...] Table 16 (partial quote: "initial LR {40.0, 20.0, 10.0, 5.0, 1.0} BAR initial LR (a1) {1.0, 0.1, 0.01, 0.005} BLACKVIP, ZIP min LR {0.1, 0.01, 0.001} BAR decaying step {0.9, 0.5, 0.1} BAR LR decaying factor {0.6, 0.5, 0.4, 0.3} BLACKVIP, ZIP initial PM (c1) {0.01, 0.005, 0.001} BLACKVIP, ZIP PM decaying factor {0.2, 0.1} BLACKVIP, ZIP std. of perturbation {1.0, 0.5} BAR smoothing {0.1, 0.01, 0.001} BAR gradient smoothing {0.9, 0.7, 0.5, 0.3} BLACKVIP population size {5, 10, 15, 20} BPTVLM intrinsic dimensionality {500, 1000, 2000} BPTVLM, ZIP rank {1, 3, 5} ZIP visual tokens {5, 10} BPTVLM text tokens {5, 10} BPTVLM")