Differentiable Prompt Learning for Vision Language Models
Authors: Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, Jianxi Gao
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the DPL method on the pre-trained CLIP. We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence. The performance on the downstream tasks exhibits the superiority of the automatic design: our method boosts the average test accuracy by 2.60 % on 11 datasets compared to baseline methods. The few-shot learning experiments show that the DPL method can find continuous prompt configurations, i.e., the context length and depth of continuous prompts inserted to the input of each layer. The performance of downstream fine-tuning over 11 datasets shows the superiority of the proposed method. |
| Researcher Affiliation | Collaboration | Zhenhan Huang1 , Tejaswini Pedapati2 , Pin-Yu Chen2 and Jianxi Gao1 1Rensselaer Polytechnic Institute 2IBM Research EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Searching stage for vision-language models 1: Input: A pre-trained model and two α matrices Aα Rℓ t with randomly initialized weights. 2: while not converged do 3: Update Aα by descending AαLval(E, Aα). 4: Update continuous prompts in both text branch and image branch by descending ELtrain(E, Aα). 5: end while 6: for i = 1 to ℓdo 7: Aα ik = maxm Aα im, k determines the context length of continuous prompts for the i-th block in the best prompt configuration. 8: end for 9: Output: Prompt configuration for the image branch and the text branch. |
| Open Source Code | Yes | We release our code in https://github.com/ Zhenhan-Huang/Differentiable-Prompt-Learn. |
| Open Datasets | Yes | We evaluate the DPL method on 11 datasets: Caltech101 [Fei-Fei et al., 2004] and Image Net [Deng et al., 2009] for the generic object classification, Describable Tectures [Cimpoi et al., 2014] for the texture classification, Euro SAT [Helber et al., 2019] for the satellite image classification, FGVCAircraft [Maji et al., 2013], Food101 [Bossard et al., 2014], Oxford Flowers [Nilsback and Zisserman, 2008], Oxford Pets [Parkhi et al., 2012], and Stanford Cars [Krause et al., 2013] for the fine-grained image recognition, UCF101 [Soomro et al., 2012] for the action classification, and SUN397 [Xiao et al., 2010] for the scene recognition. |
| Dataset Splits | No | We use the few-shot learning setting in the searching stage. The number of shots is the same for the searching and training stages. The number of shots is 16. The results of using 8/4/2/1 shots are shown in Appendix A.5. |
| Hardware Specification | Yes | Experiments are conducted using a single NVIDIA A40 GPU. |
| Software Dependencies | No | The paper mentions software components such as the pre-trained CLIP model and PyTorch, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | In both training stage and searching stage, we use the same hyperparameters except for the number of epochs. The number of epochs in the searching stage is 60 while that for the training stage is 40. The batch size is 4 and we use stochastic gradient descent (SGD) to optimize continuous prompts. In the searching stage, two α matrices are optimized using SGD strategy. Learning rate is 3.5 10 3. |