WST: Wavelet-Based Multi-scale Tuning for Visual Transfer Learning
Authors: Jia Zeng, Lan Huang, Kangping Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on transfer learning demonstrate the promising performance and efficiency of our WST. We conduct a series of experiments to evaluate our WST. 1) We evaluate effectiveness of WST on VTAB-1K benchmark for basic transfer learning tasks. 2) We verify our WST on few-shot learning. 3) We evaluate the generalization ability of WST on domain generalization. 4) We verify our WST on fine-grained classification. 5) We conduct ablation experiments and visualization to analyze our method. |
| Researcher Affiliation | Academia | 1College of Computer Science and Technology, Jilin University Changchun 130012, China 2Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University Changchun 130012, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using prose and mathematical equations (Eq 1-9) but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code https://github.com/ZJia-goo/WST |
| Open Datasets | Yes | We choose Vi T-B/16 (Dosovitskiy et al. 2020) and Swin-B (Liu et al. 2021) pre-trained on Image Net-21K (Deng et al. 2009) as backbones. We conduct fine-tuning experiments on Vi T/B-16, with results summarized in Table 1. Our WST achieves an average accuracy of 78% while requiring only 0.08M trainable parameters, surpassing previous stateof-the-art (SOTA) methods by 1.3%. Notably, our WST achieves SOTA performance on 11 out of 19 tasks. VTAB-1K benchmark (Zhai et al. 2019) is a classical benchmark to evaluate the transfer ability. It contains 19 vision datasets categorized into 3 groups: 1) Natural group, including generic and fine-grained objects; 2) Specialized group, containing images captured by specialist equipment, such as medical or remote sensing images; 3) Structured group designed for scene structure comprehension, such as depth prediction, object counting and orientation detection. The benchmark in the few-shot experiments are five fine-grained datasets, namely: Aircraft (Maji et al. 2013), Pets (Parkhi et al. 2012), Food-101 (Bossard, Guillaumin, and Van Gool 2014), Cars (Krause et al. 2013) and Flowers102 (Nilsback and Zisserman 2008). Image Net-1K is set as the source domain. The training set is constructed by randomly selecting 16 training samples from each class. We evaluate our WST on both the source domain and four target domains. The four target domains include: 1) Image Net-Sketch (Wang et al. 2019) consisting of sketch-like images with the same class as Image Net; 2) Image Net-V2 (Recht et al. 2019) collected from a larger source using the same collection process as Image Net; 3) Image Net-A (Hendrycks et al. 2021b) containing natural adversarial images and 4) Image Net-R (Hendrycks et al. 2021a) composed of a variety of artistic renditions of Image Net. Following SSF (Lian et al. 2022), we evaluate our WST on fine-grained visual classification (FGVC) benchmarks, including CUB-200-2011 (Wah et al. 2011), NABirds (Van Horn et al. 2015), Oxford Flowers (Nilsback and Zisserman 2008), Stanford Dogs (Khosla et al. 2011) and Stanford Cars (Krause et al. 2013). |
| Dataset Splits | Yes | Each dataset contains 1,000 images for training. We report top-1 accuracy on the test set. The training set contains {1,2,4,8,16}-shot samples per class (Zhang, Zhou, and Liu 2024; Fu, Zhu, and Wu 2024). We report the average top-1 accuracy on the test set over 3 random seeds. Image Net-1K is set as the source domain. The training set is constructed by randomly selecting 16 training samples from each class. We evaluate our WST on both the source domain and four target domains. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper mentions the 'Adam W optimizer (Loshchilov and Hutter 2019)' but does not specify any software libraries or their version numbers (e.g., PyTorch, TensorFlow, Python version) used for implementation. |
| Experiment Setup | Yes | The model is finetuned for 100 epochs. The Adam W optimizer (Loshchilov and Hutter 2019) is employed. The learning rate schedule adopts cosine decay strategy with decay=0.05 and 10-epoch of linear warm-up. Images are resized to 224 224. We only adopt standard augmentation strategies without Mix Up (Zhang et al. 2018) and Cut Mix (Yun et al. 2019). The convolution weights of small-scale patch embedding are initialized with a Kaiming-uniform (He et al. 2015) distribution. The weights and bias of two linear layers apply trunc-normal initialization and zero initialization. The middle dimension r in two linear layers is set to 2. The middle dimension r in two linear mappings is set to 4. |