Dual-windowed Vision Transformer with Angular Self- Attention

Authors: Weili Shi, Sheng Li

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DWAVi T on multiple computer vision benchmarks, including image classification on Image Net-1K, object detection on COCO, and semantic segmentation on ADE20K. Our experimental results also suggest that our model can achieve promising performance on the tasks while maintaining comparable computational cost with that of the baseline models (e.g., Swin Transformer).
Researcher Affiliation Academia Weili Shi EMAIL School of Data Science University of Virginia Sheng Li EMAIL School of Data Science University of Virginia
Pseudocode No The paper includes a 'Proposition 1' and its proof in the Theoretical Analysis section (3.6) which contains mathematical formulas and logical steps. However, it does not present a clearly structured 'Pseudocode' or 'Algorithm' block, figure, or section.
Open Source Code Yes The source code is available at https://github.com/Damo SWL/DWAVi T.
Open Datasets Yes We evaluate our proposed DWAVi T on Image Net-1K (Deng et al., 2009) classification, COCO (Lin et al., 2014) object detection, and ADE20K (Zhou et al., 2017) semantic segmentation.
Dataset Splits Yes The COCO dataset has 118K images for training and 5K images for validation.
Hardware Specification Yes All the experiments are running on NVIDIA A100.
Software Dependencies No The paper mentions 'Adam W' for optimization and 'MMDetection toolbox' and 'MMSegmentation toolbox' as frameworks. However, specific version numbers for these software dependencies or other core libraries like Python or PyTorch are not provided.
Experiment Setup Yes The total training epoch is 300 with the first 20 epochs as warm-up. We adopt the Adam W (Kingma & Ba, 2014) algorithm to optimize the model. The initial learning rate is 1.2e-3 and the weight decay is 0.05. The learning rate is adjusted according to the cosine learning rate schedule. The drop path rate is 0.1 and the input image is resized to 224 x 224. The mlp ratio for all the DWAVi T variants is set to 4. The number of windows in each stage is (100,49), (49,16), (4,1), (1,1). The temperature in an angular self-attention is 0.1 for DWAVi T-T and DWAVi T-S and 0.25 for DWAVi T-B, respectively. The linear function is adopted to simplify the computation of the quadratic self-attention and τ is set to 0.4.