Selective Visual Prompting in Vision Mamba

Authors: Yifeng Yao, Zichen Liu, Zhenyu Cui, Yuxin Peng, Jiahuan Zhou

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods.
Researcher Affiliation Academia Wangxuan Institute of Computer Technology, Peking University, Beijing 100871, China EMAIL, EMAIL
Pseudocode No The paper describes methods using mathematical formulations and diagrams (Figures 1, 2, 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' sections, nor does it present structured, code-like procedural steps.
Open Source Code Yes Code https://github.com/zhoujiahuan1991/AAAI2025-SVP
Open Datasets Yes Following prior works (Huang et al. 2023; Pei et al. 2024), our experiments are carried out on two image classification benchmarks HTA and VTAB. HTA. The head tuning adaptation benchmark (Huang et al. 2023) comprises 10 datasets including CIFAR10 (Krizhevsky, Hinton et al. 2009), CIFAR100 (Krizhevsky, Hinton et al. 2009), DTD (Cimpoi et al. 2014), CUB200 (Wah et al. 2011), NABirds (Van Horn et al. 2015), Stanford-Dogs (Khosla et al. 2011), Oxford-Flowers (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Van Gool 2014), GTSRB (Stallkamp et al. 2012) and SVHN (Netzer et al. 2011). VTAB-1K. It collects 19 benchmarks from Visual Task Adaptation (Zhai et al. 2019)... Our experiments primarily involve three pre-trained vision models: Vi T-Small/16 and Vim-Small, both of which are pre-trained on Image Net-1K (Russakovsky et al. 2015), and Vi T-Base/16 (Dosovitskiy et al. 2020), which is pre-trained on Image Net-21K (Krizhevsky, Sutskever, and Hinton 2012).
Dataset Splits Yes VTAB-1K. It collects 19 benchmarks from Visual Task Adaptation (Zhai et al. 2019), categorized into three groups: i) Natural, ii) Specialized, and iii) Structured, each with 1000 training examples. Following (Zhai et al. 2019; Jia et al. 2022), we use an 800-200 train/val split.
Hardware Specification No The paper discusses various pre-trained vision models and datasets used in experiments but does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for running the experiments.
Software Dependencies No The paper mentions using the AdamW optimizer and cosine annealing, but it does not specify versions for any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software dependencies required to reproduce the experiments.
Experiment Setup Yes Following (Huang et al. 2023), all methods are trained for 100 epochs across all datasets for a fair comparison. For the compared methods, we use the optimizers specified in the original papers to achieve better performance. In our approach, we utilize the Adam W (Loshchilov and Hutter 2017) optimizer for optimization and implement cosine annealing. The number of shared layers in Cross-Prompting is set to 4, 8, or 12, depending on the dataset, and the hidden dimension of the inner-prompts generator is set to 64.