Learning Mask Invariant Mutual Information for Masked Image Modeling
Authors: Tao Huang, Yanxiang Ma, Shan You, Chang Xu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on standard benchmarks show that MI-MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self-supervised learning models. |
| Researcher Affiliation | Collaboration | 1School of Computer Science, Faculty of Engineering, The University of Sydney 2Sense Time Research |
| Pseudocode | Yes | Algorithm 1 Self-supervised pre-training with MI-MAE. Our changes to MAE are marked with *. Input: Encoder E, decoder D, variational distribution approximation network V with parameters θ, training dataset Dtr, number of masks per image N. |
| Open Source Code | No | The paper does not provide an explicit statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | Image classification. Our method is developed based on the official code of MAE (He et al., 2022). We strictly adhere to the original pre-training and fine-tuning settings on Image Net-1K (Russakovsky et al., 2015). Object detection. We transfer the pre-trained Vi T models to COCO (Lin et al., 2014) dataset. Semantic segmentation. We conduct semantic segmentation experiments on the ADE20K (Zhou et al., 2017) dataset, using the same settings as in MAE (He et al., 2022). |
| Dataset Splits | Yes | We strictly adhere to the original pre-training and fine-tuning settings on Image Net-1K (Russakovsky et al., 2015). We transfer the pre-trained Vi T models to COCO (Lin et al., 2014) dataset. We adopt Mask R-CNN framework (He et al., 2017), which predicts detections and instance segmentations simultaneously. We follow the model setup and training strategy used in Vi TDet (Li et al., 2022b). |
| Hardware Specification | Yes | All our experiments use NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions several frameworks and optimizers such as AdamW, Mask R-CNN, and UperNet, but does not specify their version numbers or other software dependencies with version information. |
| Experiment Setup | Yes | Pre-training. ... We pre-train the models using an Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9, β2 = 0.95, and a weight decay of 0.05. The total batch size is 1024 ... We use a cosine decay learning rate schedule with a 10-epoch warmup and a base learning rate of 1.5 10 4. For the hyper-parameters introduced by our MI-MAE, we set λ1 = λ2 = 1 and λ3 = 10. |