Brain-Inspired Stepwise Patch Merging for Vision Transformers

Authors: Yonghao Yu, Dongcheng Zhao, Guobin Shen, Yiting Dong, Yi Zeng

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on benchmark datasets, including Image Net-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. Meanwhile, experiments show that combining SPM with different backbones can further improve performance.
Researcher Affiliation Academia 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2Brain-inspired Cognitive AI Lab, Institute of Automation, Chinese Academy of Sciences 3State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology 4School of Future Technology, University of Chinese Academy of Sciences 5Center for Long-term AI EMAIL
Pseudocode No The paper provides mathematical formulations (e.g., equations 1-5) and describes the methodology in detail, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code has been released at https://github.com/ Yonghao-Yu/Stepwise Patch Merging.
Open Datasets Yes Extensive experiments conducted on benchmark datasets, including Image Net-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. Meanwhile, experiments show that combining SPM with different backbones can further improve performance.
Dataset Splits Yes Setting. We first evaluate the proposed SPM framework on the Image Net-1K dataset [Deng et al., 2009], which comprises 1.28 million training images and 50,000 validation images spanning 1,000 categories. All models were trained on the training set comprising 118k images and evaluated on the validation set with 5k images. The ADE20K dataset [Zhou et al., 2017] is a widely utilized benchmark for semantic segmentation, comprising 150 categories with 20,210 images for training, 2,000 images for validation, and 3,352 images for testing.
Hardware Specification Yes All models are trained from scratch for 300 epochs on eight NVIDIA A100 GPUs. Our models were trained with a batch size of 16 on 8 NVIDIA A100 GPUs and optimized using the Adam W optimizer [Loshchilov and Hutter, 2017] with an initial learning rate of 1 * 10^-4. We trained our models for 40,000 iterations with a batch size of 16 on eight NVIDIA A100 GPUs.
Software Dependencies No The paper mentions software components like 'Adam W optimizer', 'Mask R-CNN', and 'Semantic FPN', but does not provide specific version numbers for any of these, nor for programming languages or core libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes For data augmentation, we apply a suite of techniques including random cropping, random horizontal flipping [Szegedy et al., 2015], label smoothing regularization [Szegedy et al., 2016], mixup [Zhang et al., 2017], Cut Mix [Yun et al., 2019], and random erasing [Zhong et al., 2020]. During training, we use the Adam W optimizer [Loshchilov and Hutter, 2017] with a momentum parameter of 0.9, a minibatch size of 128, and a weight decay of 5 * 10^-2. The initial learning rate is set to 1 * 10^-3 and follows a cosine annealing schedule [Loshchilov and Hutter, 2016] to gradually reduce the learning rate. All models are trained from scratch for 300 epochs. Our models were trained with a batch size of 16 on 8 NVIDIA A100 GPUs and optimized using the Adam W optimizer [Loshchilov and Hutter, 2017] with an initial learning rate of 1 * 10^-4. The initial learning rate was set to 0.0001, and the model was optimized using the Adam W optimizer. We trained our models for 40,000 iterations with a batch size of 16. The learning rate followed a polynomial decay schedule with a power of 0.9.