Quantized Spike-driven Transformer
Authors: Xuerui Qiu, Malu Zhang, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Haizhou Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the QSD-Transformer on various visual tasks, and experimental results indicate that our method achieves state-of-the-art results in the SNN domain. For instance, when compared to the prior SNN benchmark on Image Net, the QSD-Transformer achieves 80.3% top1 accuracy, accompanied by significant reductions of 6.0 and 8.1 in power consumption and model size, respectively. |
| Researcher Affiliation | Academia | 1University of Electronic Science and Technology of China, 2Institute of Automation, Chinese Academy of Sciences, 3School of Future Technology, University of Chinese Academy of Sciences, 4China Agricultural University, 5University of California, Santa Cruz, 6Liaoning Technical University, 7Chinese University of Hong Kong (Shenzhen) |
| Pseudocode | No | The paper describes methods using mathematical equations and prose but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | Our codes and models will be made available on Git Hub after review. |
| Open Datasets | Yes | Image Net-1K dataset (Deng et al., 2009). The comparison results are summarized in Table 1. Notably, with only 6.8M parameters, the QSD-Transformer achieves a top-1 accuracy of 80.3% in the SNN domain, showcasing significant advantages in both accuracy and efficiency. Specifically, QSD-Transformer vs. SD-Transformer v2 (Yao et al., 2023a) vs. Spiking Resformer (Shi et al., 2024)... We evaluate the efficacy of the QSD-Transformer on the object detection task and select the classic and large-scale COCO (Lin et al., 2014) dataset as our benchmark for evaluation. ...We validate the efficacy of the QSD-Transformer on the semantic segmentation task and select the challenging ADE20K (Zhou et al., 2017) dataset. ...We performed transfer learning experiments on the static image classification datasets CIFAR10/100 and the neuromorphic classification dataset CIFAR10-DVS. The CIFAR10/100 datasets each have 50,000 training and 10,000 test images with a resolution of 32 32. CIFAR10-DVS consists of 10K event streams created by capturing CIFAR10 images using a DVS camera. |
| Dataset Splits | Yes | Image Net-1K dataset (Deng et al., 2009)... contains around 1.3 million training images and 50,000 validation images. The COCO dataset comprises 118K training images and 5K validation images. The ADE20K semantic segmentation dataset comprises over 20K training and 2K validation scene-centric images... The CIFAR10/100 datasets each have 50,000 training and 10,000 test images with a resolution of 32 32. CIFAR10-DVS consists of 10K event streams created by capturing CIFAR10 images using a DVS camera. |
| Hardware Specification | Yes | We conducted training on eight 40GB A100 GPUs. For the three different model scales 1.8M, 3.8M and 6.8M parameters we allocated 24, 28 and 36 hours of training time, respectively. ... The training process was executed on four 40GB A100 GPUs and lasted for 25 hours. ... The experiments were run on a single 32GB V100 GPU, taking 12 hours for CIFAR-10 and CIFAR-100, and 10 hours for CIFAR-10-DVS. |
| Software Dependencies | No | The paper mentions using mmdetection and mmsegmentation codebases but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Table 8: Hyper-parameters for image classification on Image Net-1K and CIFAR10/100. Hyper-parameter Image Net CIFAR10/10 Timestep (Training/Inference) 1/4 1/4 Epochs 300 100 Resolution 224 224 128 128 Batch size 1568 256 Optimizer LAMB LAMB Base Learning rate 6e-4 6e-4 Learning rate decay Cosine Cosine Warmup eopchs 10 None Weight decay 0.05 0.05 Rand Augment 9/0.5 9/0.5 Mixup None 0.8 Cutmix None 1.0 Label smoothing 0.1 None. ... During the fine-tuning stage... batch size was set to 12. We used the Adam W optimizer with an initial learning rate of 1e-4, and the learning rate was decayed polynomially with a power of 0.9. ...the model was trained on the ADE20K dataset with a batch size of 20 for 160K iterations. We utilized the Adam W optimizer with an initial learning rate of 1 10 4, and the learning rate was decayed polynomially with a power of 0.9. During the initial 1500 iterations, we employed linear decay to warm up the model. ...we applied data augmentations like mixup, cutmix, and label smoothing. We used a batch size of 128, the Adam W optimizer with a weight decay of 0.01, and a cosine-decay learning rate schedule starting at 1 10 4 over 100 epochs. |