reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Release the Powers of Prompt Tuning: Cross-Modality Prompt Transfer

Authors: Ningyuan Zhang, Jie Lu, Keqiuyin Li, Zhen Fang, Guangquan Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments involving prompt transfer from 13 source language tasks to 19 target vision tasks under three settings. Our findings demonstrate that: (i) cross-modality prompt transfer is feasible, supported by in-depth analysis; (iii) cross-modality prompt transfer can significantly release the powers of prompt tuning on data-scarce tasks, as evidenced by comparisons with a newly released prompt-based benchmark.
Researcher Affiliation	Academia	Ningyuan Zhang, Jie Lu , Keqiuyin Li, Zhen Fang, Guangquan Zhang Australian Artificial Intelligence Institute, University of Technology Sydney EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Measuring modality gap via MMD
Open Source Code	No	The paper refers to source code from previous works that they used or adopted (e.g., "directly adopted from the official code repository of Su et al. (2022)" and "scripts provided by He et al. (2023)"), but does not explicitly state that the authors of this paper are releasing their own implementation code for the methodology described in this paper.
Open Datasets	Yes	Source NLP Tasks. In total, 13 investigated NLP tasks are selected as the source tasks to train the source prompt. These tasks can be categorized into four groups: (i) Sentiment Analysis (SA) tasks, including IMDB (Maas et al., 2011), SST-2 (Socher et al., 2013), laptop (Pontiki et al., 2016), restaurant (Pontiki et al., 2016), Movie Rationales (Movie, Zaidan et al. (2008)), and Tweet Eval (Tweet, Barbieri et al. (2020)). (ii) Natural Language Inference (NLI) tasks, including MNLI (Williams et al., 2017), QNLI (Wang et al., 2018), and SNLI (Bowman et al., 2015). (iii) Ethical Judgment (EJ) tasks, including deontology and justice (Hendrycks et al., 2020). (iv) Paraphrase Identification (PI) tasks, including QQP (Sharma et al., 2019) and MRPC (Dolan & Brockett, 2005). Target Vision Tasks. The VTAB-1K (Zhai et al., 2019) image classification tasks are chosen as the target tasks. VTAB-1K consists of 19 diverse image classification tasks, each with a training set of 1000 images. These tasks can be divided into three main categories: (i) Natural tasks (including CIFAR100 (Krizhevsky et al., 2009), Caltech101 (Li et al., 2004), DTD (Cimpoi et al., 2014), Flowers102 (Nilsback & Zisserman, 2006), Pets (Parkhi et al., 2012), SVHN (Netzer et al., 2011), and SUN397 (Xiao et al., 2010)) that contain natural images captured using standard cameras; (ii) Specializedtasks (including Patch Camelyon (Veeling et al., 2018), Euro SAT (Helber et al., 2018), Resisc45 (Cheng et al., 2017), and Retinopathy (Dugas et al., 2015)) that contain images captured via specialized equipment; and (iii) Structured tasks (including Clevr (Johnson et al., 2017), DMLab (Zhai et al., 2019), KITTI (Geiger et al., 2012), d Sprites (Matthey et al., 2017), and Small NORB (Le Cun et al., 2004)) that require geometric comprehension like object counting.
Dataset Splits	Yes	The VTAB-1K (Zhai et al., 2019) image classification tasks are chosen as the target tasks. VTAB-1K consists of 19 diverse image classification tasks, each with a training set of 1000 images. We use the official 800-200 split released by Zhai et al. (2019) to perform the grid-search, training on 800 images and validating using the remaining 200.
Hardware Specification	No	The paper mentions using RoBERTa-base and ViT-base models, but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. Phrases like "RoBERTa-base and ViT-base which all have a weight amount of 125M" refer to model sizes, not hardware specifications.
Software Dependencies	No	The paper mentions "PyTorch (Paszke et al., 2019)" in Section C.1 but does not specify a version number for it or any other key software components used in the experiments.
Experiment Setup	Yes	For vanilla linear probing and frozen prompt transfer, the original VPT repository released a set of linear probing hyperparameters that were carefully grid-searched on each CV task. The same hyperparameters are adopted for vanilla linear probing and frozen prompt transfer... For vanilla VPT, we follow the procedure of Jia et al. (2022) performing grid-search on learning rates {0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} and weight decay values {0, 0.0001, 0.001, 0.01}, with a batch size of 64, warm-up steps of 10, cosine learning rate scheduler, and an SGD optimizer with a momentum of 0.9... For projection transfer, a batch size of 64, a learning rate of 0.005, and a weight decay of 0.001 are used. For the optimizer, Adam (Kingma & Ba, 2014) is adopted. For attention transfer, we first perform a grid-search to find the potentially best source prompt and its concentrated length ls... Then, we perform another round of grid-search using the potentially best source prompt and ls to find the potentially best learning rate from {1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005} and weight decay value from {0, 0.0001, 0.001, 0.01}, with a batch size of 64, warm-up steps of 10, cosine learning rate scheduler, and an Adam optimizer. The complete set of hyperparameters of different scenarios on each CV task is listed in the Appendix.