MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
Authors: Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNNtransformer architectures. We conduct extensive experiments on mainstream static and neuromorphic datasets, achieving state-of-the-art performance compared to the latest SNN-based models. Finally, based on the aforementioned experimental results, we conduct the ablation study to discuss and analyze MSVIT. |
| Researcher Affiliation | Collaboration | 1China Nanhu Academy of Electronics and Information Technology, China 2University of Chinese Academy of Sciences, China 3The Hong Kong Polytechnic University, Hong Kong SAR, China 4School of Systems and Computing, The University of New South Wales, Australia EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes mathematical formulations and architectural components but does not include explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | The codes are available at https://github.com/Nanhu-AI-Lab/MSVi T. The source code is open-sourced and available at https:// github.com/Nanhu-AI-Lab/MSVi T. |
| Open Datasets | Yes | For static image classification, we use Image Net-1K [Deng et al., 2009] and CIFAR10/100 [Krizhevsky et al., 2009]. For neuromorphic classification, we employ the CIFAR10-DVS [Li et al., 2017] and DVS128 Gesture [Amir et al., 2017] datasets. |
| Dataset Splits | Yes | We evaluate MSVIT on both static image classification and neuromorphic classification tasks. For static image classification, we use Image Net-1K [Deng et al., 2009] and CIFAR10/100 [Krizhevsky et al., 2009]. For neuromorphic classification, we employ the CIFAR10-DVS [Li et al., 2017] and DVS128 Gesture [Amir et al., 2017] datasets. The experimental results are on Image Net-1K. Energy is calculated as the average theoretical power consumption when predicting an image from Image Net test set. Appendix D introduces the datasets in detail. |
| Hardware Specification | Yes | First, the model is trained in a distributed manner for 200 epochs on an 8A100 GPU server. |
| Software Dependencies | No | The paper mentions optimizers like synchronized Adam W but does not specify software dependencies like libraries or frameworks with version numbers. |
| Experiment Setup | Yes | First, the model is trained in a distributed manner for 200 epochs on an 8A100 GPU server. We employ several data augmentation techniques, including Rand Augment [Cubuk et al., 2020], random erasing [Zhong et al., 2020], and stochastic depth [Huang et al., 2016], with a batch size of 512. Additionally, gradient accumulation is utilized to stabilize training, as suggested in [He et al., 2022]. Second, the optimization process leverages synchronized Adam W with a base learning rate of 6 10 4 per batch size of 512. The learning rate is linearly warmed up at the initial stage and subsequently decays following a half-period cosine schedule. The effective runtime learning rate is scaled proportionally to the batch size, calculated as Batch Size/256 multiplied by the base learning rate. For CIFAR10, MSVIT achieves an accuracy of 96.53%, outperforming Spikformer by 1.02% and QKFormer [Zhou et al., 2024a] by 0.35%. For CIFAR100, MSVIT achieves an accuracy of 81.98%, exceeding Spikformer (78.21%) by 3.77% and QKFormer by 0.86%. For this experiment, we implement a lightweight version of MSVIT with only 1.67M parameters, utilizing a block configuration of {0, 1, 1} across the three stages. The maximum patch embedding dimension is set to 256. The model is trained for 200 epochs on the DVS128-Gesture dataset and 106 epochs on the CIFAR10-DVS dataset. The number of time steps for the spiking neurons is set to either 10 or 16. |