Learning to Prompt with Text Only Supervision for Vision-Language Models
Authors: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform evaluations on 4 benchmarks, where Pro Text improves over ensembling methods while being competitive with those using labeled images. |
| Researcher Affiliation | Collaboration | 1MBZ University of AI 2Computer Science Department and Center of Secure Cyber-Physical Security Systems, Khalifa University 3INSAIT 4TU Munich 5Google |
| Pseudocode | No | The paper describes the method and framework using textual descriptions and a visual diagram (Figure 2), but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/muzairkhattak/Pro Text |
| Open Datasets | Yes | For cross-dataset transfer, domain generalization, and base-to-novel generalization, we use 11 datasets that cover multiple recognition tasks. These includes Image Net (Deng et al. 2009) and Caltech101 (Fei-Fei, Fergus, and Perona 2004) which contains generic objects; Oxford Pets (Parkhi et al. 2012), Stanford Cars (Krause et al. 2013), Flowers102 (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Gool 2014), and FGVCAircraft (Maji et al. 2013) for fine-grained classification, SUN397 (Xiao et al. 2010) for scene recognition, UCF101 (Soomro, Zamir, and Shah 2012) for action recognition, DTD (Cimpoi et al. 2014) for texture classification, and Euro SAT (Helber et al. 2019) for satellite images. For domain generalization, we train models on Image Net (Deng et al. 2009) as a source dataset and use Image Net-A (Hendrycks et al. 2021b), Image Net R (Hendrycks et al. 2021a), Image Net-S (Wang et al. 2019) and Image Net V2 (Recht et al. 2019) for OOD evaluation. |
| Dataset Splits | Yes | Following previous methods (Zhou et al. 2022a), we split each dataset into base and novel classes. Models are trained on base classes and evaluated on the test set of base and novel classes... Optimal training configuration is obtained through hyper-parameter search on validation split of datasets. |
| Hardware Specification | Yes | Adam W optimizer is used with 5 warm-up epochs for training using a 16-GB V100 GPU. |
| Software Dependencies | No | The paper mentions using a "pretrained Vi T-B/16 CLIP model from Open AI" and various LLMs (GPT-3, Mixtral-8x7B) for data generation, but does not specify the version numbers of any core programming languages or libraries (e.g., Python, PyTorch) used for implementation. |
| Experiment Setup | Yes | Pro Text is with Deep Language Prompting in the first 9 transformer blocks of the CLIP text encoder. For cross-dataset transfer and domain generalization setting, we train Pro Text using T = 4 and T = 16 language prompts with 10 and 200 epochs respectively... Adam W optimizer is used with 5 warm-up epochs for training... Setting prompt length to 16 leads to optimal performance. Fig. 3 (b) shows the effect of prompt depth on final performance where prompt depth of 9 shows optimal results. |