Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts
Authors: Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming. Our code is available at https://github.com/ tmlr-group/Decoupled VP Section 5 shows the application of DVP to 11 commonly used downstream datasets and four CLIP backbones, demonstrating its effectiveness. The parameter analysis, ablation experiments, and independence tests further validate the rationality of DVP. In conclusion, both theoretical analysis and experimental results verify the soundness of DVP. |
| Researcher Affiliation | Collaboration | 1School of Computing and Information Systems, The University of Melbourne 2School of Computer Science and Engineering, Southeast University 3Idealism Technology (Beijing). Correspondence to: Feng Liu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Pipeline of DVP |
| Open Source Code | Yes | Our code is available at https://github.com/ tmlr-group/Decoupled VP |
| Open Datasets | Yes | All datasets are publicly available and listed as follows: FGVCAircraft (Aircraft) (Maji et al., 2013), Caltech101 (Caltech) (Fei-Fei et al., 2004), Stanford Cars (Cars) (Krause et al., 2013), Texture (DTD) (Cimpoi et al., 2014), Euro SAT (ESAT) (Helber et al., 2019), Flowers102 (Flowers) (Nilsback & Zisserman, 2008), Food101 (Food) (Bossard et al., 2014), Oxford Pets (Pets) (Parkhi et al., 2012), SUN397 (SUN) (Xiao, et al., 2010), UCF101 (UCF) (Soomro et al., 2012), Resisc45 (Resisc) (Cheng et al., 2017). |
| Dataset Splits | Yes | We follow the prior work (Cai et al., 2025) to set up our benchmark, employing the same methodology to split the 16-shot training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for the experiments. It only mentions general support from 'The University of Melbourne's Research Computing Services and the Petascale Campus Initiative'. |
| Software Dependencies | No | The paper mentions using an 'SGD optimizer' and a 'cosine annealing scheduler' but does not specify the version numbers for any software libraries (e.g., Python, PyTorch, TensorFlow, scikit-learn) or specific tools used beyond general algorithmic references like K-means. |
| Experiment Setup | Yes | All VR baseline methods are trained with consistent settings: a learning rate of 40, a momentum of 0.9 (SGD optimizer (Harold et al., 1997)), and a cosine annealing scheduler (Loshchilov & Hutter, 2016), over 200 epochs. Results are averaged across three random seeds. For method-specific hyper-parameters, we followed (Cai et al., 2025) by using a VR noise pattern with a frame size of 30 for VP (Bahng et al., 2022) and a frame size of 16 for AR (Chen et al., 2023; Tsai et al., 2020) and Attr VR (Cai et al., 2025). To ensure fairness, our DVP utilized the same settings as Attr VR. For DVP-cls, we use the same descriptions as (Cai et al., 2025). For K-means, we use a maximum iteration of 300 and the relative tolerance regarding the Frobenius norm of differences in cluster centers to be 1e-4. For DVP-cse, we use GPT-4o-mini (Brown et al., 2020) to generate descriptions, with the maximum token to be 50, stopped at . , and the temperature to be 0.99. We set m = 20 for each cause number in DVP-cse and the attribute numbers in DVP-cls. |