Rethinking Patch Dependence for Masked Autoencoders

Authors: Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, XuDong Wang, Adam Yala, Trevor Darrell, Alexei A Efros, Ken Goldberg

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (Cross MAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from Vi T-S to Vi T-H and significantly reduces computational requirements. By its design, Cross MAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io/.
Researcher Affiliation Academia Letian Fu EMAIL UC Berkeley Long Lian EMAIL UC Berkeley Renhao Wang EMAIL UC Berkeley Baifeng Shi EMAIL UC Berkeley Xudong Wang EMAIL UC Berkeley Adam Yala EMAIL UC Berkeley, UCSF Trevor Darrell EMAIL UC Berkeley Alexei A. Efros EMAIL UC Berkeley Ken Goldberg EMAIL UC Berkeley
Pseudocode Yes The pseudo-code of inter-block attention is the following: 1 class Inter Block Attention ():
Open Source Code Yes Code and models are publicly available: https://crossmae.github.io/.
Open Datasets Yes A quantitative comparison over the Image Net validation set shows that the masked tokens in MAE disproportionally attend to the visible tokens (1.42 vs 0.39), questioning the necessity of attention within mask tokens. We perform self-supervised pretraining on Image Net-1K, following MAE (He et al., 2022) s hyperparameter settings, only modifying the learning rate and decoder depth. We additionally evaluate models pretrained with Cross MAE for object detection and instance segmentation, which require deeper spatial understanding than Image Net classification. Specifically, we follow Vi TDet (Li et al., 2022b), a method that leverages a Vision Transformer backbone for object detection and instance segmentation. We report box AP for object detection and mask AP for instance segmentation on the COCO dataset (Lin et al., 2014), following MAE (He et al., 2022). To further investigate the performance of Cross MAE, we provide results on additional downstream tasks, including classification on i Naturalist2019, Places365, and semantic segmentation on ADE20K. These experiments demonstrate that Cross MAE performs comparably to MAE for transfer learning while offering improved performance on specific tasks.
Dataset Splits Yes We perform self-supervised pretraining on Image Net-1K, following MAE (He et al., 2022) s hyperparameter settings, only modifying the learning rate and decoder depth. A quantitative comparison over the Image Net validation set shows that the masked tokens in MAE disproportionally attend to the visible tokens (1.42 vs 0.39), questioning the necessity of attention within mask tokens. Setup. The model performance is evaluated with end-to-end fine-tuning, with top-1 accuracy used for comparison. Same as in Figure. 2, we compare two versions of Cross MAE: one with a prediction ratio of 25% (1/3 of the mask tokens) and another with 75% (all mask tokens). Both models are trained with a mask ratio of 75% and a decoder depth of 12.
Hardware Specification Yes Each of the pretraining and finetuning experiments is run on 2 or 4 NVIDIA A100 80GB GPUs. The batch size per GPU is scaled accordingly and we use gradient accumulation to avoid out-of-memory errors. Vi TDet (Li et al., 2022b) experiments use a single machine equipped with 8 NVIDIA A100 (80GB) GPUs. We copy the datasets to the shared memory on the machines to accelerate dataloading. We use Flash Attention-2 (Dao, 2023) to accelerate attention calculation. In Table 5, we provide a more structured ablation of the different components of Cross MAE and their affect on runtime. The setup mirrors that of Table 3, with runtime measured on 2x A100 80GB GPUs, utilizing Flash-Attention 2 (Dao, 2023) across all models and gradient accumulation set to 2 for 400 epochs.
Software Dependencies Yes We use Flash Attention-2 (Dao, 2023) to accelerate attention calculation. Config Value optimizer Adam W (Loshchilov & Hutter, 2017b) base learning rate 1.5e-4 learning rate schedule cosine decay (Loshchilov & Hutter, 2017a) batch size 4096 weight decay 0.05 optimizer momentum β1, β2 = 0.9, 0.95 (Chen et al., 2020a) warm up epoch (Goyal et al., 2017a) 20, 40 total epochs 400, 800 Rand Aug (9, 0.5) (Cubuk et al., 2019) label smoothing (Szegedy et al., 2016) 0.1 mixup (Zhang et al., 2018) 0.8 cutmix (Yun et al., 2019) 1.0 drop path (Huang et al., 2016) 0.1
Experiment Setup Yes We perform self-supervised pretraining on Image Net-1K, following MAE (He et al., 2022) s hyperparameter settings, only modifying the learning rate and decoder depth. The hyperparameters were initially determined on Vi T-Base and then directly applied to Vi T-Small, Vi T-Large, and Vi T-Huge. Both Cross MAE and MAE are trained for 800 epochs. After pre-training, we evaluate the pre-trained models by fine-tuning them for image classification and instance segmentation. We provide implementation details and more experiments on different datasets and downstream tasks (i Naturalist (Van Horn et al., 2018), Places365 (Zhou et al., 2017), and ADE20K (Zhou et al., 2019)) in Appendix B.5. Table 10: Pretraining Hyperparameters Config Value optimizer Adam W (Loshchilov & Hutter, 2017b) base learning rate 1.5e-4 learning rate schedule cosine decay (Loshchilov & Hutter, 2017a) batch size 4096 weight decay 0.05 optimizer momentum β1, β2 = 0.9, 0.95 (Chen et al., 2020a) warm up epoch (Goyal et al., 2017a) 20, 40 total epochs 400, 800 Table 11: Finetuning Hyperparameters Config Value optimizer Adam W base learning rate 1e-3 learning rate schedule cosine decay batch size 1024 weight decay 0.05 optimizer momentum β1, β2 = 0.9, 0.999 warm up epoch 5 total epochs 100 (B), 50 (L) augmentation Rand Aug (9, 0.5) (Cubuk et al., 2019) label smoothing (Szegedy et al., 2016) 0.1 mixup (Zhang et al., 2018) 0.8 cutmix (Yun et al., 2019) 1.0 drop path (Huang et al., 2016) 0.1