Extreme Masking for Learning Instance and Distributed Visual Representations
Authors: Zhirong Wu, Zihang Lai, Xiao Sun, Stephen Lin
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we systematically study the model behavior under different masking ratios, its convergence properties using multiple masks on larger datasets, and integration with various other data augmentations. Based on the study observations, we also propose a new augmentation scheme which uses shared image crops but different colors for the two input views. Our main results on Image Net1k outperform prior masked modeling approaches on both finetuning and linear probing metrics. |
| Researcher Affiliation | Collaboration | Zhirong Wu1 Zihang Lai2 Xiao Sun1 Stephen Lin1 Microsoft Research Asia1 Carnegie Mellon University2 |
| Pseudocode | No | The paper describes the Extre MA approach and its components like Extreme Masking, Distributed and Instance Representations, and Learning Objective in prose. No structured pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The paper does not explicitly state that source code for the methodology is provided or publicly available, nor does it include any links to a code repository. |
| Open Datasets | Yes | Our main results on Image Net1k outperform prior masked modeling approaches on both finetuning and linear probing metrics... We therefore study multi-masking on Image Net22k... We evaluate semantic segmentation performance on the ADE20K (Zhou et al., 2017) dataset... We evaluate the transfer performance on the MSCOCO dataset. |
| Dataset Splits | Yes | We pretrain the representation on Image Net and evaluate it on finetuning (ft) and linear probe (lin) in our ablations. We finetune the model on top of the distributed representation, and conduct linear probes with the instance representation. The evaluation protocol mainly follows BEi T and MAE... Given the pretrained model, we use a small fraction of the Image Net1k training labels (1% or 10%) for semi-supervised finetuning. |
| Hardware Specification | Yes | Notably, this is achieved by training Extre MA using a single node of 8 V100 GPUs in about two days for a Vi T-Base model. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer, but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages. |
| Experiment Setup | Yes | We use the original Vi T-base (Dosovitskiy et al., 2021) as the backbone architecture without the layer scale technique (Touvron et al., 2021b). The class attention follows the original design in (Touvron et al., 2021b) with a default of two transformer blocks and a layer scale hyper-parameter of 0.1. We train our model using the Adam W optimizer (Loshchilov & Hutter, 2018) with a batch size of 2048, an initial base learning rate of 1.5e-4, and a weight decay of 0.1. The exponential averaging weight for the momentum encoder is initialized to 0.996 and increased to 1.0 following a cosine schedule. The default augmentation is random resized cropping and random flipping. All models are trained for 300 epochs. Further details are provided in Table 13 and Table 14. |