Separable Self-attention for Mobile Vision Transformers

Authors: Sachin Mehta, Mohammad Rastegari

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on standard vision datasets and tasks demonstrates the effectiveness of the proposed method (Fig. 2).
Researcher Affiliation Industry Sachin Mehta Apple Inc. Mohammad Rastegari Apple Inc.
Pseudocode No The paper describes mathematical operations (e.g., Eq. 1 and 2) and architectural diagrams (Fig. 3, 4, 6) but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes Our source code is available at: https://github.com/apple/ml-cvnets.
Open Datasets Yes We train Mobile Vi Tv2 for 300 epochs... on the Image Net-1k dataset (Russakovsky et al., 2015)... study its performance on MS-COCO dataset (Lin et al., 2014)... two standard semantic segmentation datasets, ADE20k (Zhou et al., 2017) and PASCAL VOC 2012 (Everingham et al., 2015).
Dataset Splits Yes We train Mobile Vi Tv2 for 300 epochs with an effective batch size of 1024 images... on the Image Net-1k dataset (Russakovsky et al., 2015) with 1.28 million and 50 thousand training and validation images respectively. ...split it into about 11 million and 522 thousand training and validation images spanning over 10,450 classes, respectively. ...Table 11: Configuration for finetuning Mobile Vi Tv2 on downstream tasks. Dataset MS-COCO ... # Training samples 117 k # Validation samples 5 k
Hardware Specification Yes These results are computed on a single CPU core machine with a 2.4 GHz 8-Core Intel Core i9 processor... Here, inference time is measured on an i Phone12... throughput is measured on NVIDIA V100 GPUs...
Software Dependencies No The paper mentions PyTorch and CVNets but does not specify their version numbers or other software dependencies with version details.
Experiment Setup Yes We train Mobile Vi Tv2 for 300 epochs with an effective batch size of 1024 images (128 images per GPU 8 GPUs) using Adam W of Loshchilov & Hutter (2019) on the Image Net-1k dataset... We linearly increase the learning rate from 10 6 to 0.002 for the first 20k iterations. After that, the learning rate is decayed using a cosine annealing policy (Loshchilov & Hutter, 2017). Tables 9, 10, and 11 provide extensive details on training configurations for various tasks and datasets.