reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cluster and Predict Latents Patches for Improved Masked Image Modeling

Authors: Timothée Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, Piotr Bojanowski

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we report empirical evaluations of our model. We describe experimental details and present some ablation studies. Then we discuss whole-image understanding results and dense prediction results.
Researcher Affiliation	Collaboration	Timothée Darcet EMAIL Meta FAIR Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK; Federico Baldassarre EMAIL Meta FAIR; Maxime Oquab EMAIL Meta FAIR; Julien Mairal EMAIL Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK; Piotr Bojanowski EMAIL Meta FAIR
Pseudocode	Yes	We provide the pseudo-code for the standard Sinkhorn-Knopp and for our modified version in fig. 12.
Open Source Code	Yes	We release all our code and models.
Open Datasets	Yes	Pretraining dataset. Most methods from the self-supervised learning literature choose to pretrain on Image Net-1k. This dataset is usually chosen because of its relatively small size, allowing for easy experimentation, and the ability to compare to existing methods pretrained on it. To this end, we carry out all ablation experiments on Image Net-22k. It is composed of 14M images from 22k categories taken from the Word Net ontology. Although it is close to Image Net-1k in nature, its much larger size and diversity make it suitable to train excellent foundation models, as reported by Oquab et al. (2024). For our longer experiments, we train on multiple datasets: Image Net-1k, for comparability with previous works, Image Net-22k, to test scaling, Places205, to test training on more diverse and less object-centric data, and finally LVD-142M, a large-scale automatically curated dataset used in previous SSL foundation models. We refer the reader to Oquab et al. (2024) for more details on the curation process. We use ADE-20k (Zhou et al., 2017), Pascal VOC 2012 (Everingham et al., 2010), and Cityscapes (Cordts et al., 2016).
Dataset Splits	Yes	For all classification tasks, we use an attentive probe (Assran et al., 2023; El-Nouby et al., 2024; Bardes et al., 2023). We choose the optimal regularization parameters by doing a grid search using 10% of the training set. We compute the features for the train and test set considered at resolution 224, and hold out 10% of the train set as a validation set.
Hardware Specification	Yes	For the Vi T-L, this batch size fits in 4 nodes of 8 A100 80GB GPUs. We measure the training of a CAPI Vi T-L model to take 180h on 32 A100 GPUs, amounting to 5763 A100 hours.
Software Dependencies	No	The training dataset is lightly augmented using a torchvision Random Resized Crop (maintainers & contributors, 2016) with default hyperparameters and a random horizontal flip. The linear classifier is trained for logistic regression with L-BFGS (Byrd et al., 1995) regularized with L2 penalty, as implemented in the cuml library (Raschka et al., 2020).
Experiment Setup	Yes	Table 7: CAPI pretraining recipe