reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Masked Image Modeling Representations

Authors: Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter, Johannes Brandstetter

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we show that within few epochs, MIM-Refiner refines the features of a MIM model to (i) incorporate the beneficial properties of ID objectives (ii) preserves the advantages of the MIM model (iii) exploits the synergies of both methods to improve upon each individual pretraining objective, advancing the state-of-the-art across various benchmarks, see Figure 2. Extensive evaluations show the potential of MIM-Refiner for training large-scale vision foundation models. Our contributions can be summarized as follows: ... 3. We experimentally show the effectiveness and generality of MIM-Refiner by refining a multitude of MIM models of various scales, which achieve new state-of-the-art results in a broad range of downstream tasks.
Researcher Affiliation	Collaboration	Benedikt Alkin1,2,@ Lukas Miklautz4 Sepp Hochreiter1,2,3 Johannes Brandstetter1,2 1ELLIS Unit Linz, Institute for Machine Learning, JKU Linz, Austria 2Emmi AI Gmb H, Linz, Austria 3NXAI Gmb H, Linz, Austria 4Faculty of Computer Science, University of Vienna, Vienna, Austria
Pseudocode	No	The paper formally describes the Nearest Neighbor Alignment (NNA) objective using mathematical equations (e.g., Equation 1 and 2) and visual diagrams (Figure 4), but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The full codebase used for all experiments in this paper together with the exact hyperparameter configurations that were used for each experiment and pre-trained models can be found in our github repository: https://github.com/ml-jku/MIM-Refiner.
Open Datasets	Yes	We refine a series of MIM models, namely MAE (He et al., 2022), data2vec 2.0 (Baevski et al., 2023) (abbreviated as D2V2), d BOT (Liu et al., 2022) and Cross MAE (Fu et al., 2024). These models were pre-trained on Image Net-1K (Deng et al., 2009), which we also use for the refinement. We fine-tune MIM models and their refined versions on COCO (Lin et al., 2014) using the Mask R-CNN (He et al., 2017) configuration of the Vi TDet (Li et al., 2022) framework from detectron2. We show the individual accuracies for each VTAB (Zhai et al., 2019) dataset from Table 4 of the main paper.
Dataset Splits	Yes	For the 1, 2 and 5-shot benchmarks we train a logistic regression (Caron et al., 2021; Assran et al., 2022) using the [CLS] token after the last encoder block with the cyanure (Mairal, 2019) library. ... We report the average of three dataset splits from MSN (Assran et al., 2022). For fine-tuning models on VTAB-1K we provide the hyperparameters in Table 27. We search for the best learning rate for each dataset by fine-tuning the model 25 times (5 learning rates with 5 seeds each) on the 800 training samples and evaluating them on the 200 validation samples. With the best learning rate, we then train each model 5 times on concatenation of training and validation split, evaluate on the test split and report the average accuracy.
Hardware Specification	Yes	All models are pre-trained on multiple nodes of 4x A100-64GB GPUs where Vi T-L uses 4 nodes (i.e. 16 GPUs), Vi T-H 8 nodes of 4x A100 (i.e. 32 GPUs) and Vi T-2B uses 16 nodes (i.e. 64 GPUs). For evaluations, we use a mix of 4x A100-64GB nodes, 8x A100-40GB nodes and various smaller nodes that vary in number of GPUs.
Software Dependencies	Yes	Benchmarks are conducted in pytorch 2.1 with CUDA 12.1.
Experiment Setup	Yes	Hyperparameters for the refinement stage are listed in Table 24. Following MAE-CT (Lehner et al., 2024), we initialize all ID heads first by training them with a frozen encoder to ensure a good learning signal from the start of the refinement process. For this initialization, we use the same hyperparameters as in Table 24 except that we use 20 epochs for all models, a learning rate of 2e-4 and a top1-NN lookup. As we do not use a momentum encoder during training, we instead track an EMA of the encoder and use the EMA then for downstream tasks. As Vi T-2B is very expensive to train, we freeze the first 6 blocks (for refinement and also for evaluation). As shown in Table 8 this slightly reduces performance but also reduces memory consumption and runtime.