Learning Mask Invariant Mutual Information for Masked Image Modeling

Authors: Tao Huang, Yanxiang Ma, Shan You, Chang Xu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on standard benchmarks show that MI-MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self-supervised learning models.
Researcher Affiliation Collaboration 1School of Computer Science, Faculty of Engineering, The University of Sydney 2Sense Time Research
Pseudocode Yes Algorithm 1 Self-supervised pre-training with MI-MAE. Our changes to MAE are marked with *. Input: Encoder E, decoder D, variational distribution approximation network V with parameters θ, training dataset Dtr, number of masks per image N.
Open Source Code No The paper does not provide an explicit statement about the release of source code or a link to a code repository.
Open Datasets Yes Image classification. Our method is developed based on the official code of MAE (He et al., 2022). We strictly adhere to the original pre-training and fine-tuning settings on Image Net-1K (Russakovsky et al., 2015). Object detection. We transfer the pre-trained Vi T models to COCO (Lin et al., 2014) dataset. Semantic segmentation. We conduct semantic segmentation experiments on the ADE20K (Zhou et al., 2017) dataset, using the same settings as in MAE (He et al., 2022).
Dataset Splits Yes We strictly adhere to the original pre-training and fine-tuning settings on Image Net-1K (Russakovsky et al., 2015). We transfer the pre-trained Vi T models to COCO (Lin et al., 2014) dataset. We adopt Mask R-CNN framework (He et al., 2017), which predicts detections and instance segmentations simultaneously. We follow the model setup and training strategy used in Vi TDet (Li et al., 2022b).
Hardware Specification Yes All our experiments use NVIDIA V100 GPUs.
Software Dependencies No The paper mentions several frameworks and optimizers such as AdamW, Mask R-CNN, and UperNet, but does not specify their version numbers or other software dependencies with version information.
Experiment Setup Yes Pre-training. ... We pre-train the models using an Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9, β2 = 0.95, and a weight decay of 0.05. The total batch size is 1024 ... We use a cosine decay learning rate schedule with a 10-epoch warmup and a base learning rate of 1.5 10 4. For the hyper-parameters introduced by our MI-MAE, we set λ1 = λ2 = 1 and λ3 = 10.