Autoregressive Pretraining with Mamba in Vision

Authors: Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive results are provided showing our proposed ARM achieves substantially stronger performance. As shown in Figure 1, ARM helps our base-size model attain 83.2% Image Net accuracy, outperforming its supervised counterpart by 2.0%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0% Image Net accuracy (85.5% when finetuned with 384x384 inputs), notably surpassing all other Mamba variants in vision. The paper also includes detailed sections on 'EXPERIMENT', 'MAIN RESULTS', 'ROBUSTNESS', 'DOWNSTREAM GENERALIZATION', and 'ABLATION STUDY', all of which present empirical data, metrics, and comparisons.
Researcher Affiliation Collaboration The authors are affiliated with: 1Johns Hopkins University, 2UC Santa Cruz, 3Alibaba Group, 4UCSD, 5Byte Dance. This includes academic institutions (Johns Hopkins University, UC Santa Cruz, UCSD) and industry affiliations (Alibaba Group, Byte Dance), indicating a collaboration.
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code Yes The code is available at https://github.com/Oliver Rensu/ARM.
Open Datasets Yes We pretrain ARM using the Image Net-1K dataset (Deng et al., 2009). In addition to testing on the Image Net evaluation set, we evaluate model robustness without finetuning on various out-of-domain Image Net variants, including natural adversarial examples (Image Net A (Hendrycks et al., 2021b)), semantic shifts (Image Net-R (Hendrycks et al., 2021a)), image sketches (Image Net-S (Wang et al., 2019)), Image Net-V2 (Recht et al., 2019), and Image Net-Real (Beyer et al., 2020). Moreover, we finetune the pretrained model on different downstream tasks including object detection and instance segmentation on COCO (Lin et al., 2014), and semantic segmentation on ADE20K (Zhou et et al., 2019).
Dataset Splits Yes We pretrain ARM using the Image Net-1K dataset (Deng et al., 2009). We finetune the ARM models on the Image Net classification task. Moreover, we finetune the pretrained model on different downstream tasks including object detection and instance segmentation on COCO (Lin et al., 2014), and semantic segmentation on ADE20K (Zhou et al., 2019). These are standard benchmark datasets with well-defined, commonly used splits for classification, object detection/segmentation, and semantic segmentation tasks.
Hardware Specification Yes For example, under the setting of training the base-size Mamba for 300 epochs, autoregressive training requires only 34 hours (measured by 8 A5000).
Software Dependencies No The paper mentions using the Adam W optimizer, Mask R-CNN, Swin Transformer's protocol, and the mmsegmentation toolkit, but it does not provide specific version numbers for any of these software components or other libraries/environments.
Experiment Setup Yes Pretraining. We pretrain ARM using the Image Net-1K dataset (Deng et al., 2009). Specifically, ARM-B and ARM-L are pre-trained for 1600 epochs, and ARM-H is pre-trained for 800 epochs. We use a batch size of 2048/1024/512 for ARM-B/L/H, respectively, and a learning rate of lr = 1.5e-4 batchsize / 256. We adopt a cosine decay schedule with a warm-up for 5 epochs. We adopt the Adam W (Loshchilov & Hutter, 2019) optimizer with a weight decay of 0.05. We use random resized cropping and random horizontal flipping. The pretraining input size is set to 192x192. Finetuning. Following pretraining, we finetune the ARM models on the Image Net classification task. Specifically, we finetune all models for 100 epochs with a batch size of 1024, with the input size set at 224x224. We use the same data augmentation as MAE (He et al., 2022). We adopt Adam W as an optimizer, and the peak learning rate is lr=5e-4 batchsize / 256 with a cosine decay schedule and a warm-up for 5 epochs. Additionally, we employ the exponential moving average (EMA) (Izmailov et al., 2018) for stronger performance.