MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Masked Image Modeling Representations
Authors: Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter, Johannes Brandstetter
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we show that within few epochs, MIM-Refiner refines the features of a MIM model to (i) incorporate the beneficial properties of ID objectives (ii) preserves the advantages of the MIM model (iii) exploits the synergies of both methods to improve upon each individual pretraining objective, advancing the state-of-the-art across various benchmarks, see Figure 2. Extensive evaluations show the potential of MIM-Refiner for training large-scale vision foundation models. Our contributions can be summarized as follows: ... 3. We experimentally show the effectiveness and generality of MIM-Refiner by refining a multitude of MIM models of various scales, which achieve new state-of-the-art results in a broad range of downstream tasks. |
| Researcher Affiliation | Collaboration | Benedikt Alkin1,2,@ Lukas Miklautz4 Sepp Hochreiter1,2,3 Johannes Brandstetter1,2 1ELLIS Unit Linz, Institute for Machine Learning, JKU Linz, Austria 2Emmi AI Gmb H, Linz, Austria 3NXAI Gmb H, Linz, Austria 4Faculty of Computer Science, University of Vienna, Vienna, Austria |
| Pseudocode | No | The paper formally describes the Nearest Neighbor Alignment (NNA) objective using mathematical equations (e.g., Equation 1 and 2) and visual diagrams (Figure 4), but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The full codebase used for all experiments in this paper together with the exact hyperparameter configurations that were used for each experiment and pre-trained models can be found in our github repository: https://github.com/ml-jku/MIM-Refiner. |
| Open Datasets | Yes | We refine a series of MIM models, namely MAE (He et al., 2022), data2vec 2.0 (Baevski et al., 2023) (abbreviated as D2V2), d BOT (Liu et al., 2022) and Cross MAE (Fu et al., 2024). These models were pre-trained on Image Net-1K (Deng et al., 2009), which we also use for the refinement. We fine-tune MIM models and their refined versions on COCO (Lin et al., 2014) using the Mask R-CNN (He et al., 2017) configuration of the Vi TDet (Li et al., 2022) framework from detectron2. We show the individual accuracies for each VTAB (Zhai et al., 2019) dataset from Table 4 of the main paper. |
| Dataset Splits | Yes | For the 1, 2 and 5-shot benchmarks we train a logistic regression (Caron et al., 2021; Assran et al., 2022) using the [CLS] token after the last encoder block with the cyanure (Mairal, 2019) library. ... We report the average of three dataset splits from MSN (Assran et al., 2022). For fine-tuning models on VTAB-1K we provide the hyperparameters in Table 27. We search for the best learning rate for each dataset by fine-tuning the model 25 times (5 learning rates with 5 seeds each) on the 800 training samples and evaluating them on the 200 validation samples. With the best learning rate, we then train each model 5 times on concatenation of training and validation split, evaluate on the test split and report the average accuracy. |
| Hardware Specification | Yes | All models are pre-trained on multiple nodes of 4x A100-64GB GPUs where Vi T-L uses 4 nodes (i.e. 16 GPUs), Vi T-H 8 nodes of 4x A100 (i.e. 32 GPUs) and Vi T-2B uses 16 nodes (i.e. 64 GPUs). For evaluations, we use a mix of 4x A100-64GB nodes, 8x A100-40GB nodes and various smaller nodes that vary in number of GPUs. |
| Software Dependencies | Yes | Benchmarks are conducted in pytorch 2.1 with CUDA 12.1. |
| Experiment Setup | Yes | Hyperparameters for the refinement stage are listed in Table 24. Following MAE-CT (Lehner et al., 2024), we initialize all ID heads first by training them with a frozen encoder to ensure a good learning signal from the start of the refinement process. For this initialization, we use the same hyperparameters as in Table 24 except that we use 20 epochs for all models, a learning rate of 2e-4 and a top1-NN lookup. As we do not use a momentum encoder during training, we instead track an EMA of the encoder and use the EMA then for downstream tasks. As Vi T-2B is very expensive to train, we freeze the first 6 blocks (for refinement and also for evaluation). As shown in Table 8 this slightly reduces performance but also reduces memory consumption and runtime. |