Exploring Efficient Few-shot Adaptation for Vision Transformers

Authors: Chengming Xu, Siqian Yang, Yabiao Wang, Zhanxiong Wang, Yanwei Fu, Xiangyang Xue

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to show the efficacy of our model. Our method receives outstanding performance on the challenging Meta-Dataset. We conduct extensive experiments to evaluate our e TT: we use the Vi T-tiny and Vi T-small backbones on the large-scale Meta-Dataset (Triantafillou et al., 2019) consisting of ten sub-datasets from different domains; and the results show that our model can achieve outstanding performance with comparable or even much fewer model parameters.
Researcher Affiliation Collaboration Chengming Xu EMAIL School of Data Science, Fudan University; Siqian Yang EMAIL Youtu Lab, Tencent; Yabiao Wang EMAIL Youtu Lab, Tencent; Zhanxiong Wang EMAIL Tencent; Yanwei Fu EMAIL School of Data Science, Fudan University; Xiangyang Xue EMAIL School of Data Science, Fudan University
Pseudocode No The paper describes its methodology in natural language and mathematical formulations within the 'Methodology' section (Section 3) and its subsections, but it does not include any distinct pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/loadder/eTT_TMLR2022.
Open Datasets Yes We use Meta-Dataset (Triantafillou et al., 2019) the most comprehensive and challenging large-scale FSL benchmark. It has 10 sub-datasets such as Image Net (Deng et al., 2009) and Omniglot (Lake et al., 2015), with various domain gaps.
Dataset Splits Yes In each episode T , N is first uniformly sampled from [5, Nmax] where Nmax equals to min(50, |Ct|) or min(50, |Cs|) on training or testing stage, accordingly. N is supposed to be accessible knowledge during both training and testing. In the most naïve case, one can get N by directly counting the number of support classes. From each of the sampled category, M query samples per category are randomly selected, and thus constructing the query set Q = {(Iq i , yq i )}NQ i=1. After that random amount of samples are taken from the rest of samples belonging to these categories to form the support set S = {(Isupp i , ysupp i )}NS i=1. Note that compared to the classical N-way K-shot setting, such a setting generates class-imbalanced support sets, and different episodes contain different numbers of support samples.
Hardware Specification No The paper mentions comparing model sizes and FLOPs for different backbones (Res18, ViT-tiny, Res34, ViT-small) in Table 1, but it does not specify any particular hardware (e.g., GPU, CPU models) used to perform the experiments or training.
Software Dependencies No The paper mentions using the AdamW optimizer and DINO for pretraining. It also cites 'fvcore' for FLOPs calculation in a footnote. However, it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used for implementation.
Experiment Setup Yes We set the patch size as 8 for Vi T-tiny (as it has small input image size), and keep the other hyper-parameters as default. We adopt standard Vi T-small with 12 layers, 6 attention heads, feature dimension as 384 and patch size as 16. We strictly follow the hyper-parameter setting and data augmentation in DINO (Caron et al., 2021) for pretraining. In test-time finetuning, we empirically set the hidden dimension d of the transformation module as d/2, and output dimension dproj of the projector as 64 for all datasets. We utilize Adam W optimizer finetuning, with learning rate set as 1e 3 for Traffic Sign and 5e 4 for other datasets. λ is set as 0.1. For simplicity, the selection of hyper-parameters is conducted on the meta-validation set of Image Net, which is the only within-domain setting in Meta-Dataset.