Separable Self-attention for Mobile Vision Transformers
Authors: Sachin Mehta, Mohammad Rastegari
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on standard vision datasets and tasks demonstrates the effectiveness of the proposed method (Fig. 2). |
| Researcher Affiliation | Industry | Sachin Mehta Apple Inc. Mohammad Rastegari Apple Inc. |
| Pseudocode | No | The paper describes mathematical operations (e.g., Eq. 1 and 2) and architectural diagrams (Fig. 3, 4, 6) but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | Our source code is available at: https://github.com/apple/ml-cvnets. |
| Open Datasets | Yes | We train Mobile Vi Tv2 for 300 epochs... on the Image Net-1k dataset (Russakovsky et al., 2015)... study its performance on MS-COCO dataset (Lin et al., 2014)... two standard semantic segmentation datasets, ADE20k (Zhou et al., 2017) and PASCAL VOC 2012 (Everingham et al., 2015). |
| Dataset Splits | Yes | We train Mobile Vi Tv2 for 300 epochs with an effective batch size of 1024 images... on the Image Net-1k dataset (Russakovsky et al., 2015) with 1.28 million and 50 thousand training and validation images respectively. ...split it into about 11 million and 522 thousand training and validation images spanning over 10,450 classes, respectively. ...Table 11: Configuration for finetuning Mobile Vi Tv2 on downstream tasks. Dataset MS-COCO ... # Training samples 117 k # Validation samples 5 k |
| Hardware Specification | Yes | These results are computed on a single CPU core machine with a 2.4 GHz 8-Core Intel Core i9 processor... Here, inference time is measured on an i Phone12... throughput is measured on NVIDIA V100 GPUs... |
| Software Dependencies | No | The paper mentions PyTorch and CVNets but does not specify their version numbers or other software dependencies with version details. |
| Experiment Setup | Yes | We train Mobile Vi Tv2 for 300 epochs with an effective batch size of 1024 images (128 images per GPU 8 GPUs) using Adam W of Loshchilov & Hutter (2019) on the Image Net-1k dataset... We linearly increase the learning rate from 10 6 to 0.002 for the first 20k iterations. After that, the learning rate is decayed using a cosine annealing policy (Loshchilov & Hutter, 2017). Tables 9, 10, and 11 provide extensive details on training configurations for various tasks and datasets. |