Tree of Attributes Prompt Learning for Vision-Language Models
Authors: Tong Ding, Wanhua Li, Zhongqi Miao, Hanspeter Pfister
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods on the zero-shot base-to-novel generalization, cross-dataset transfer, as well as few-shot classification across 11 diverse datasets. |
| Researcher Affiliation | Collaboration | Tong Ding1,2 Wanhua Li1 Zhongqi Miao3 Hanspeter Pfister1 1Harvard University 2Mass General Brigham 3Microsoft |
| Pseudocode | No | The paper describes the methodology in prose and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/HHenry D/TAP. Our code is also released under the MIT license. |
| Open Datasets | Yes | For all of the three settings, we follow previous works (Zhou et al., 2022b;a), using 11 image recognition datasets, including: Image Net (Deng et al., 2009) and Caltech101 (Fei-Fei et al., 2004) for generic object recognition; Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and FGVCAircraft (Maji et al., 2013) for fine-grained classification; SUN397 (Xiao et al., 2010) for scene recognition; UCF101 (Soomro et al., 2012) for action recognition; DTD (Cimpoi et al., 2014) for texture classification; and Euro SAT (Helber et al., 2019) for satellite image analysis. |
| Dataset Splits | Yes | In base-to-novel generalization, we equally split the classes into base and novel classes. Initial training and evaluations are conducted on the seen base classes, followed by evaluation on the unseen novel classes in a zero-shot manner. ... train on Image Net with 16 shots per class, and directly evaluate on other datasets in zero-shot; and Few-shot classification with 16 shots per class. |
| Hardware Specification | Yes | We use Py Torch Paszke et al. (2017) to implement all experiments on a single NVIDIA A100-80GB GPU. |
| Software Dependencies | No | The paper mentions "We use Py Torch Paszke et al. (2017)" but does not specify a version number for PyTorch or any other software libraries or programming languages used. |
| Experiment Setup | Yes | During training, we iteratively train the vision and text encoders with 5 epochs for vision and 1 epoch for text schedule. We train a total of 60, 24, and 120 epochs for base-to-novel generalization, cross-dataset transfer, and few-shot classification respectively. We set α = 0.4 for all datasets. We also use a Gaussian Prompt Weighting (GPA) following (Khattak et al., 2023b), with a mean of 0.9N, std of 0.1N, where N represents the total number of epochs, for all tasks. ... For DTD, Oxford Flowers, Stanford Cars, UCF101, and Caltech101 datasets, which have fewer attributes, we use a low learning rate of 0.002 for the text encoder to avoid overfitting and a high learning rate of 0.006 for the vision encoder to facilitate the learning process. A high µ3 = 3 is also used to regularize the text encoder for preventing overfitting. For the remaining 6 datasets, which have more attributes, the learning rates for both text and vision encoders are set as 0.004, with µ3 = 1.5. µ1 = 10, and µ2 = 2.5 are used for all datasets. |