reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Simplifying DINO via Coding Rate Regularization

Authors: Ziyang Wu, Jingyuan Zhang, Druv Pai, Xudong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Overall, our experiments show that our proposed Sim DINO model families can achieve better performance and learn representations of higher quality than the original DINO families while being significantly simpler and more robust to variations in hyperparameters and architecture. In this section, we empirically investigate and evaluate our proposed Sim DINO and Sim DINOv2 models and compare them to the original DINO and DINOv2 model families. In particular, we examine their differences in training dynamics and learned representation both quantitatively and qualitatively. We provide quantitative results in Table 1, 2, 3 and qualitative results in Figure 3, 5.
Researcher Affiliation	Collaboration	1UC Berkeley 2Transc Engram 3Microsoft Research 4HKU. Correspondence to: Ziyang Wu <EMAIL>.
Pseudocode	Yes	The overall pipeline is shown in Figure 1(b). Note that it is much simpler than DINO. We provide pseudocode for the training pipeline in Algorithm 1 in Appendix D. We provide pseudocode for the training pipeline in Algorithm 2 in Appendix D.
Open Source Code	Yes	Code and model checkpoints are available at https: //github.com/Robin Wu218/Sim DINO.
Open Datasets	Yes	For pretraining, we use the Image Net-1K dataset across all methods. Specifically, we evaluate our pretrained models on 1) unsupervised object detection and segmentation on COCO val2017 (Lin et al., 2014), 2) semantic segmentation on ADE20K (Zhou et al., 2017), and 3) video object segmentation on DAVIS-2017 (Pont-Tuset et al., 2017).
Dataset Splits	Yes	We report the classification accuracies on Image Net-1k in Table 1. Following (Caron et al., 2021), we evaluate both k-NN and linear accuracy on the Vi T backbones pretrained by the DINO model families and our simplified variants. To quantitatively evaluate these representation, we perform Mask Cut on the COCO val2017 dataset and report our results in Table 2. Specifically, we follow the linear evaluation protocol of (Zhou et al., 2021), where we fix the pretrained backbone and only finetune a linear layer on top of it. We follow the same evaluation protocol as in (Caron et al., 2021) and segment scenes between consecutive video frames with nearest neighbor.
Hardware Specification	No	The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using Adam W as an optimizer and bfloat16 dtype, but it does not provide specific version numbers for any key software libraries, frameworks, or environments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Specifically, for all inputs we set patch size to be 16; we use the small, base, and large models of the Vi T (Dosovitskiy, 2020) architecture as the backbone, which is connected to a projector composed of three MLP layers with a hidden size of 2048 and an output dimension of 256. For multicrop augmentation, we use 10 local views of resolution 96 96 and 2 global views of resolution 224 224 for all experiments. We provide more details on hyperparameter choices in Appendix E. Appendix E (Table 4) provides extensive details including: Patch size 16, Init EMA momentum 0.996, Global crops scale 0.4 1, Local crops number 10, Batch size 128x8, Epochs 100, Learning rate 0.004, Weight decay 0.04, Gradient clip 3.0.