Autoregressive Pretraining with Mamba in Vision
Authors: Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive results are provided showing our proposed ARM achieves substantially stronger performance. As shown in Figure 1, ARM helps our base-size model attain 83.2% Image Net accuracy, outperforming its supervised counterpart by 2.0%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0% Image Net accuracy (85.5% when finetuned with 384x384 inputs), notably surpassing all other Mamba variants in vision. The paper also includes detailed sections on 'EXPERIMENT', 'MAIN RESULTS', 'ROBUSTNESS', 'DOWNSTREAM GENERALIZATION', and 'ABLATION STUDY', all of which present empirical data, metrics, and comparisons. |
| Researcher Affiliation | Collaboration | The authors are affiliated with: 1Johns Hopkins University, 2UC Santa Cruz, 3Alibaba Group, 4UCSD, 5Byte Dance. This includes academic institutions (Johns Hopkins University, UC Santa Cruz, UCSD) and industry affiliations (Alibaba Group, Byte Dance), indicating a collaboration. |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | The code is available at https://github.com/Oliver Rensu/ARM. |
| Open Datasets | Yes | We pretrain ARM using the Image Net-1K dataset (Deng et al., 2009). In addition to testing on the Image Net evaluation set, we evaluate model robustness without finetuning on various out-of-domain Image Net variants, including natural adversarial examples (Image Net A (Hendrycks et al., 2021b)), semantic shifts (Image Net-R (Hendrycks et al., 2021a)), image sketches (Image Net-S (Wang et al., 2019)), Image Net-V2 (Recht et al., 2019), and Image Net-Real (Beyer et al., 2020). Moreover, we finetune the pretrained model on different downstream tasks including object detection and instance segmentation on COCO (Lin et al., 2014), and semantic segmentation on ADE20K (Zhou et et al., 2019). |
| Dataset Splits | Yes | We pretrain ARM using the Image Net-1K dataset (Deng et al., 2009). We finetune the ARM models on the Image Net classification task. Moreover, we finetune the pretrained model on different downstream tasks including object detection and instance segmentation on COCO (Lin et al., 2014), and semantic segmentation on ADE20K (Zhou et al., 2019). These are standard benchmark datasets with well-defined, commonly used splits for classification, object detection/segmentation, and semantic segmentation tasks. |
| Hardware Specification | Yes | For example, under the setting of training the base-size Mamba for 300 epochs, autoregressive training requires only 34 hours (measured by 8 A5000). |
| Software Dependencies | No | The paper mentions using the Adam W optimizer, Mask R-CNN, Swin Transformer's protocol, and the mmsegmentation toolkit, but it does not provide specific version numbers for any of these software components or other libraries/environments. |
| Experiment Setup | Yes | Pretraining. We pretrain ARM using the Image Net-1K dataset (Deng et al., 2009). Specifically, ARM-B and ARM-L are pre-trained for 1600 epochs, and ARM-H is pre-trained for 800 epochs. We use a batch size of 2048/1024/512 for ARM-B/L/H, respectively, and a learning rate of lr = 1.5e-4 batchsize / 256. We adopt a cosine decay schedule with a warm-up for 5 epochs. We adopt the Adam W (Loshchilov & Hutter, 2019) optimizer with a weight decay of 0.05. We use random resized cropping and random horizontal flipping. The pretraining input size is set to 192x192. Finetuning. Following pretraining, we finetune the ARM models on the Image Net classification task. Specifically, we finetune all models for 100 epochs with a batch size of 1024, with the input size set at 224x224. We use the same data augmentation as MAE (He et al., 2022). We adopt Adam W as an optimizer, and the peak learning rate is lr=5e-4 batchsize / 256 with a cosine decay schedule and a warm-up for 5 epochs. Additionally, we employ the exponential moving average (EMA) (Izmailov et al., 2018) for stronger performance. |