Attribute-based Visual Reprogramming for Vision-Language Models

Authors: Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, it achieves superior performance in 12 downstream tasks for both Vi T-based and Res Net-based CLIP. The success of Attr VR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/Attr VR. Experiments conducted on 12 widely-used benchmarks demonstrate the effectiveness of Attr VR in Section 6. Attr VR consistently outperforms other VR methods when using different encoder backbones or fewer training samples. Visualizations of the embedding space and individual samples with their top-matched attributes also substantiate the efficacy of Attr VR. Additional ablation, hyper-parameter (see Section 6) and aggregation studies (see Appendix C.3) further examine the contributions of different components within Attr VR.
Researcher Affiliation Collaboration Chengyi Cai1 Zesheng Ye1 Lei Feng2,3 Jianzhong Qi1 Feng Liu1 1The University of Melbourne 2Southeast University 3Idealism Technology (Beijing)
Pseudocode Yes Algorithm 1 Training Pipeline of Attr VR
Open Source Code Yes Our code is available at https://github.com/tmlr-group/Attr VR.
Open Datasets Yes Experiments conducted on 12 widely-used benchmarks demonstrate the effectiveness of Attr VR in Section 6. ... All image datasets are publicly available. Detailed task information and the batch size used for training VR are provided in Table 4.
Dataset Splits Yes This paper establishes benchmarks for downstream classification tasks following prior work (Oh et al., 2023), employing the same methodology to split the 16-shot training, validation, and test sets.
Hardware Specification Yes Experiments are conducted on a single A100 GPU.
Software Dependencies No The paper mentions using 'GPT-3.5 (Brown, 2020)' for attribute generation and 'SGD optimizer (Harold et al., 1997)' with a 'cosine annealing learning rate scheduler (Loshchilov & Hutter, 2016)' for training. While these are specific algorithms or models, the paper does not list specific software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x) that are crucial for replication.
Experiment Setup Yes Regarding hyper-parameters in Attr VR, we set k = 3 and λ = 0.5 and will discuss their impact. ... For all VR baseline methods compared in the paper, we adopted the following uniform training settings: an initial learning rate of 40, a momentum of 0.9 using the SGD optimizer (Harold et al., 1997), and a cosine annealing learning rate scheduler (Loshchilov & Hutter, 2016). The total number of learning epochs was set to 200.