reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Authors: Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNNtransformer architectures. We conduct extensive experiments on mainstream static and neuromorphic datasets, achieving state-of-the-art performance compared to the latest SNN-based models. Finally, based on the aforementioned experimental results, we conduct the ablation study to discuss and analyze MSVIT.
Researcher Affiliation	Collaboration	1China Nanhu Academy of Electronics and Information Technology, China 2University of Chinese Academy of Sciences, China 3The Hong Kong Polytechnic University, Hong Kong SAR, China 4School of Systems and Computing, The University of New South Wales, Australia EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes mathematical formulations and architectural components but does not include explicit pseudocode blocks or algorithms.
Open Source Code	Yes	The codes are available at https://github.com/Nanhu-AI-Lab/MSVi T. The source code is open-sourced and available at https:// github.com/Nanhu-AI-Lab/MSVi T.
Open Datasets	Yes	For static image classification, we use Image Net-1K [Deng et al., 2009] and CIFAR10/100 [Krizhevsky et al., 2009]. For neuromorphic classification, we employ the CIFAR10-DVS [Li et al., 2017] and DVS128 Gesture [Amir et al., 2017] datasets.
Dataset Splits	Yes	We evaluate MSVIT on both static image classification and neuromorphic classification tasks. For static image classification, we use Image Net-1K [Deng et al., 2009] and CIFAR10/100 [Krizhevsky et al., 2009]. For neuromorphic classification, we employ the CIFAR10-DVS [Li et al., 2017] and DVS128 Gesture [Amir et al., 2017] datasets. The experimental results are on Image Net-1K. Energy is calculated as the average theoretical power consumption when predicting an image from Image Net test set. Appendix D introduces the datasets in detail.
Hardware Specification	Yes	First, the model is trained in a distributed manner for 200 epochs on an 8A100 GPU server.
Software Dependencies	No	The paper mentions optimizers like synchronized Adam W but does not specify software dependencies like libraries or frameworks with version numbers.
Experiment Setup	Yes	First, the model is trained in a distributed manner for 200 epochs on an 8A100 GPU server. We employ several data augmentation techniques, including Rand Augment [Cubuk et al., 2020], random erasing [Zhong et al., 2020], and stochastic depth [Huang et al., 2016], with a batch size of 512. Additionally, gradient accumulation is utilized to stabilize training, as suggested in [He et al., 2022]. Second, the optimization process leverages synchronized Adam W with a base learning rate of 6 10 4 per batch size of 512. The learning rate is linearly warmed up at the initial stage and subsequently decays following a half-period cosine schedule. The effective runtime learning rate is scaled proportionally to the batch size, calculated as Batch Size/256 multiplied by the base learning rate. For CIFAR10, MSVIT achieves an accuracy of 96.53%, outperforming Spikformer by 1.02% and QKFormer [Zhou et al., 2024a] by 0.35%. For CIFAR100, MSVIT achieves an accuracy of 81.98%, exceeding Spikformer (78.21%) by 3.77% and QKFormer by 0.86%. For this experiment, we implement a lightweight version of MSVIT with only 1.67M parameters, utilizing a block configuration of {0, 1, 1} across the three stages. The maximum patch embedding dimension is set to 256. The model is trained for 200 epochs on the DVS128-Gesture dataset and 106 epochs on the CIFAR10-DVS dataset. The number of time steps for the spiking neurons is set to either 10 or 16.