Masked Image Residual Learning for Scaling Deeper Vision Transformers

Authors: Guoxi Huang, Hongtao Fu, Adrian G. Bors

NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed MIRL method is evaluated on image classification, object detection and semantic segmentation tasks. All models are pre-trained on Image Net-1K and then fine-tuned in downstream tasks. ... Table 2: MIRL ablation experiments on Image Net-1K
Researcher Affiliation Collaboration Guoxi Huang Baidu Inc. EMAIL Hongtao Fu Huazhong University of Science and Technology EMAIL Adrian G. Bors University of York EMAIL
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code and pretrained models are available at: https://github.com/russellllaputa/MIRL.
Open Datasets Yes We pre-train all models on the training set of Image Net-1K with 32 GPUs. ... The experiment is conducted on MS COCO [30]... We compare our method with previous results on the ADE20K [61] dataset
Dataset Splits Yes All models are pre-trained on Image Net-1K and then fine-tuned in downstream tasks. ... Table 2: MIRL ablation experiments on Image Net-1K: We report the fine-tuning (ft) accuracy(%) for all models, which are pre-trained for 300 epochs.
Hardware Specification No We pre-train all models on the training set of Image Net-1K with 32 GPUs.
Software Dependencies No The paper mentions frameworks and libraries such as Transformer architecture, MAE, Mask R-CNN, and mmdetection, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Pre-training setup. We pre-train all models on the training set of Image Net-1K with 32 GPUs. By default, Vi T-B-24 is divided into 4 segments, while Vi T-S-54 and Vi T-B-48 are split into 6 segments, and others into 2. Each appended decoder has 2 Transformer blocks with an injected DID module. We follow the setup in [21], masking 75% of visual tokens and applying basic data augmentation, including random horizontal flipping and random resized cropping. Full implementation details are in Appendix A.