reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Prompt with Text Only Supervision for Vision-Language Models

Authors: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform evaluations on 4 benchmarks, where Pro Text improves over ensembling methods while being competitive with those using labeled images.
Researcher Affiliation	Collaboration	1MBZ University of AI 2Computer Science Department and Center of Secure Cyber-Physical Security Systems, Khalifa University 3INSAIT 4TU Munich 5Google
Pseudocode	No	The paper describes the method and framework using textual descriptions and a visual diagram (Figure 2), but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/muzairkhattak/Pro Text
Open Datasets	Yes	For cross-dataset transfer, domain generalization, and base-to-novel generalization, we use 11 datasets that cover multiple recognition tasks. These includes Image Net (Deng et al. 2009) and Caltech101 (Fei-Fei, Fergus, and Perona 2004) which contains generic objects; Oxford Pets (Parkhi et al. 2012), Stanford Cars (Krause et al. 2013), Flowers102 (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Gool 2014), and FGVCAircraft (Maji et al. 2013) for fine-grained classification, SUN397 (Xiao et al. 2010) for scene recognition, UCF101 (Soomro, Zamir, and Shah 2012) for action recognition, DTD (Cimpoi et al. 2014) for texture classification, and Euro SAT (Helber et al. 2019) for satellite images. For domain generalization, we train models on Image Net (Deng et al. 2009) as a source dataset and use Image Net-A (Hendrycks et al. 2021b), Image Net R (Hendrycks et al. 2021a), Image Net-S (Wang et al. 2019) and Image Net V2 (Recht et al. 2019) for OOD evaluation.
Dataset Splits	Yes	Following previous methods (Zhou et al. 2022a), we split each dataset into base and novel classes. Models are trained on base classes and evaluated on the test set of base and novel classes... Optimal training configuration is obtained through hyper-parameter search on validation split of datasets.
Hardware Specification	Yes	Adam W optimizer is used with 5 warm-up epochs for training using a 16-GB V100 GPU.
Software Dependencies	No	The paper mentions using a "pretrained Vi T-B/16 CLIP model from Open AI" and various LLMs (GPT-3, Mixtral-8x7B) for data generation, but does not specify the version numbers of any core programming languages or libraries (e.g., Python, PyTorch) used for implementation.
Experiment Setup	Yes	Pro Text is with Deep Language Prompting in the first 9 transformer blocks of the CLIP text encoder. For cross-dataset transfer and domain generalization setting, we train Pro Text using T = 4 and T = 16 language prompts with 10 and 200 epochs respectively... Adam W optimizer is used with 5 warm-up epochs for training... Setting prompt length to 16 leads to optimal performance. Fig. 3 (b) shows the effect of prompt depth on final performance where prompt depth of 9 shows optimal results.