reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Masked Image Residual Learning for Scaling Deeper Vision Transformers

Authors: Guoxi Huang, Hongtao Fu, Adrian G. Bors

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed MIRL method is evaluated on image classification, object detection and semantic segmentation tasks. All models are pre-trained on Image Net-1K and then fine-tuned in downstream tasks. ... Table 2: MIRL ablation experiments on Image Net-1K
Researcher Affiliation	Collaboration	Guoxi Huang Baidu Inc. EMAIL Hongtao Fu Huazhong University of Science and Technology EMAIL Adrian G. Bors University of York EMAIL
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and pretrained models are available at: https://github.com/russellllaputa/MIRL.
Open Datasets	Yes	We pre-train all models on the training set of Image Net-1K with 32 GPUs. ... The experiment is conducted on MS COCO [30]... We compare our method with previous results on the ADE20K [61] dataset
Dataset Splits	Yes	All models are pre-trained on Image Net-1K and then fine-tuned in downstream tasks. ... Table 2: MIRL ablation experiments on Image Net-1K: We report the fine-tuning (ft) accuracy(%) for all models, which are pre-trained for 300 epochs.
Hardware Specification	No	We pre-train all models on the training set of Image Net-1K with 32 GPUs.
Software Dependencies	No	The paper mentions frameworks and libraries such as Transformer architecture, MAE, Mask R-CNN, and mmdetection, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Pre-training setup. We pre-train all models on the training set of Image Net-1K with 32 GPUs. By default, Vi T-B-24 is divided into 4 segments, while Vi T-S-54 and Vi T-B-48 are split into 6 segments, and others into 2. Each appended decoder has 2 Transformer blocks with an injected DID module. We follow the setup in [21], masking 75% of visual tokens and applying basic data augmentation, including random horizontal flipping and random resized cropping. Full implementation details are in Appendix A.