Cluster and Predict Latents Patches for Improved Masked Image Modeling

Authors: Timothée Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, Piotr Bojanowski

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we report empirical evaluations of our model. We describe experimental details and present some ablation studies. Then we discuss whole-image understanding results and dense prediction results.
Researcher Affiliation Collaboration Timothée Darcet EMAIL Meta FAIR Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK; Federico Baldassarre EMAIL Meta FAIR; Maxime Oquab EMAIL Meta FAIR; Julien Mairal EMAIL Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK; Piotr Bojanowski EMAIL Meta FAIR
Pseudocode Yes We provide the pseudo-code for the standard Sinkhorn-Knopp and for our modified version in fig. 12.
Open Source Code Yes We release all our code and models.
Open Datasets Yes Pretraining dataset. Most methods from the self-supervised learning literature choose to pretrain on Image Net-1k. This dataset is usually chosen because of its relatively small size, allowing for easy experimentation, and the ability to compare to existing methods pretrained on it. To this end, we carry out all ablation experiments on Image Net-22k. It is composed of 14M images from 22k categories taken from the Word Net ontology. Although it is close to Image Net-1k in nature, its much larger size and diversity make it suitable to train excellent foundation models, as reported by Oquab et al. (2024). For our longer experiments, we train on multiple datasets: Image Net-1k, for comparability with previous works, Image Net-22k, to test scaling, Places205, to test training on more diverse and less object-centric data, and finally LVD-142M, a large-scale automatically curated dataset used in previous SSL foundation models. We refer the reader to Oquab et al. (2024) for more details on the curation process. We use ADE-20k (Zhou et al., 2017), Pascal VOC 2012 (Everingham et al., 2010), and Cityscapes (Cordts et al., 2016).
Dataset Splits Yes For all classification tasks, we use an attentive probe (Assran et al., 2023; El-Nouby et al., 2024; Bardes et al., 2023). We choose the optimal regularization parameters by doing a grid search using 10% of the training set. We compute the features for the train and test set considered at resolution 224, and hold out 10% of the train set as a validation set.
Hardware Specification Yes For the Vi T-L, this batch size fits in 4 nodes of 8 A100 80GB GPUs. We measure the training of a CAPI Vi T-L model to take 180h on 32 A100 GPUs, amounting to 5763 A100 hours.
Software Dependencies No The training dataset is lightly augmented using a torchvision Random Resized Crop (maintainers & contributors, 2016) with default hyperparameters and a random horizontal flip. The linear classifier is trained for logistic regression with L-BFGS (Byrd et al., 1995) regularized with L2 penalty, as implemented in the cuml library (Raschka et al., 2020).
Experiment Setup Yes Table 7: CAPI pretraining recipe