EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba
Authors: Xiaohuan Pei, Tao Huang, Chang Xu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that, Efficient VMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, Efficient VMamba-S with 1.3G FLOPs improves Vim-Ti with 1.5G FLOPs by a large margin of 5.6% accuracy on Image Net. |
| Researcher Affiliation | Academia | School of Computer Science, Faculty of Engineering, The University of Sydney, Australia EMAIL |
| Pseudocode | No | The paper describes methodologies using text and mathematical equations, but it does not contain a clearly labeled pseudocode block or algorithm section. |
| Open Source Code | Yes | Code https://github.com/TerryPei/EfficientVMamba |
| Open Datasets | Yes | We train our models for 300 epochs with a base batch size of 1024 and an Adam W optimizer, a cosine annealing learning rate schedule is adopted with initial value 10 3 and 20-epoch warmup. For training data augmentation, we use random cropping, Auto Augment (Cubuk et al. 2019) with policy rand-m9-mstd0.5, and random erasing of pixels with a probability of 0.25 on each image, then a Mix Up (Zhang et al. 2017) strategy with ratio 0.2 is adopted in each batch. An exponential moving average on model is adopted with decay rate 0.9999. Tiny models (<1 GFLOPs). In the pursuit of efficiency, the results of tiny models are shown in Table 2. Efficient VMamba-T achieves state-of-art performance with a Top-1 accuracy of 76.5%, rivalling its counterparts that demand higher computational costs. With a modest expenditure of only 0.8 GFLOPs, our model surpasses the PVTv2-B0 by a 6% margin in accuracy and outperforms the Mobile Vi T-XS by 1.7%, all with less computational demand. |
| Dataset Splits | Yes | Following previous works (Touvron et al. 2021a; Liu et al. 2021; Zhu et al. 2024; Liu et al. 2024b), we train our models for 300 epochs with a base batch size of 1024 and an Adam W optimizer, a cosine annealing learning rate schedule is adopted with initial value 10 3 and 20-epoch warmup. For training data augmentation, we use random cropping, Auto Augment (Cubuk et al. 2019) with policy rand-m9-mstd0.5, and random erasing of pixels with a probability of 0.25 on each image, then a Mix Up (Zhang et al. 2017) strategy with ratio 0.2 is adopted in each batch. An exponential moving average on model is adopted with decay rate 0.9999. (...) We evaluate the efficacy of our Efficient VMamba model for object detection tasks on the MSCOCO 2017 (Lin et al. 2014) dataset. Our evaluation framework relies on the mmdetection library (Chen et al. 2019). For comparisons with light-weight backbones, we follow Pv T (Wang et al. 2021) to use Retina Net as the detector and adopt 1 training schedule. While for comparisons with larger backbones, our experiment follows the hyperparameter settings detailed in Swin (Liu et al. 2021) We use the Adam W optimization method to refine the weights of our pre-trained networks on Image Net-1K for durations of 12 and 36 epochs. We apply drop path rates of 0.2% across the board for Efficient VMamba-T/S/B variants. The learning rate begins at 1e 5 and is decreased tenfold at epochs 9 and 11. Multi-scale training and random flipping are implemented during training with a batch size of 16, adhering to standard procedures for evaluating object detection systems. (...) Aligning with Vmamba (Liu et al. 2024b) settings, we integrate an Uper Head into the pretrained model structure. Utilizing the Adam W optimizer, we initiate the learning rate at 6 10 5. The fine-tuning stage consists of 160k iterations, using a batch size of 16. While the standard input resolution stands at 512 512, we also conduct experiments with 640 640 inputs and apply multiscale (MS) testing to broaden our evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions software like 'mmdetection library' and optimizers like 'Adam W', but it does not specify any version numbers for these software components or libraries. |
| Experiment Setup | Yes | Following previous works (Touvron et al. 2021a; Liu et al. 2021; Zhu et al. 2024; Liu et al. 2024b), we train our models for 300 epochs with a base batch size of 1024 and an Adam W optimizer, a cosine annealing learning rate schedule is adopted with initial value 10 3 and 20-epoch warmup. For training data augmentation, we use random cropping, Auto Augment (Cubuk et al. 2019) with policy rand-m9-mstd0.5, and random erasing of pixels with a probability of 0.25 on each image, then a Mix Up (Zhang et al. 2017) strategy with ratio 0.2 is adopted in each batch. An exponential moving average on model is adopted with decay rate 0.9999. (...) We use the Adam W optimization method to refine the weights of our pre-trained networks on Image Net-1K for durations of 12 and 36 epochs. We apply drop path rates of 0.2% across the board for Efficient VMamba-T/S/B variants. The learning rate begins at 1e 5 and is decreased tenfold at epochs 9 and 11. Multi-scale training and random flipping are implemented during training with a batch size of 16, adhering to standard procedures for evaluating object detection systems. (...) Utilizing the Adam W optimizer, we initiate the learning rate at 6 10 5. The fine-tuning stage consists of 160k iterations, using a batch size of 16. While the standard input resolution stands at 512 512, we also conduct experiments with 640 640 inputs and apply multiscale (MS) testing to broaden our evaluation. |