reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tree of Attributes Prompt Learning for Vision-Language Models

Authors: Tong Ding, Wanhua Li, Zhongqi Miao, Hanspeter Pfister

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods on the zero-shot base-to-novel generalization, cross-dataset transfer, as well as few-shot classification across 11 diverse datasets.
Researcher Affiliation	Collaboration	Tong Ding1,2 Wanhua Li1 Zhongqi Miao3 Hanspeter Pfister1 1Harvard University 2Mass General Brigham 3Microsoft
Pseudocode	No	The paper describes the methodology in prose and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/HHenry D/TAP. Our code is also released under the MIT license.
Open Datasets	Yes	For all of the three settings, we follow previous works (Zhou et al., 2022b;a), using 11 image recognition datasets, including: Image Net (Deng et al., 2009) and Caltech101 (Fei-Fei et al., 2004) for generic object recognition; Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and FGVCAircraft (Maji et al., 2013) for fine-grained classification; SUN397 (Xiao et al., 2010) for scene recognition; UCF101 (Soomro et al., 2012) for action recognition; DTD (Cimpoi et al., 2014) for texture classification; and Euro SAT (Helber et al., 2019) for satellite image analysis.
Dataset Splits	Yes	In base-to-novel generalization, we equally split the classes into base and novel classes. Initial training and evaluations are conducted on the seen base classes, followed by evaluation on the unseen novel classes in a zero-shot manner. ... train on Image Net with 16 shots per class, and directly evaluate on other datasets in zero-shot; and Few-shot classification with 16 shots per class.
Hardware Specification	Yes	We use Py Torch Paszke et al. (2017) to implement all experiments on a single NVIDIA A100-80GB GPU.
Software Dependencies	No	The paper mentions "We use Py Torch Paszke et al. (2017)" but does not specify a version number for PyTorch or any other software libraries or programming languages used.
Experiment Setup	Yes	During training, we iteratively train the vision and text encoders with 5 epochs for vision and 1 epoch for text schedule. We train a total of 60, 24, and 120 epochs for base-to-novel generalization, cross-dataset transfer, and few-shot classification respectively. We set α = 0.4 for all datasets. We also use a Gaussian Prompt Weighting (GPA) following (Khattak et al., 2023b), with a mean of 0.9N, std of 0.1N, where N represents the total number of epochs, for all tasks. ... For DTD, Oxford Flowers, Stanford Cars, UCF101, and Caltech101 datasets, which have fewer attributes, we use a low learning rate of 0.002 for the text encoder to avoid overfitting and a high learning rate of 0.006 for the vision encoder to facilitate the learning process. A high µ3 = 3 is also used to regularize the text encoder for preventing overfitting. For the remaining 6 datasets, which have more attributes, the learning rates for both text and vision encoders are set as 0.004, with µ3 = 1.5. µ1 = 10, and µ2 = 2.5 are used for all datasets.