Simplifying DINO via Coding Rate Regularization
Authors: Ziyang Wu, Jingyuan Zhang, Druv Pai, Xudong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Overall, our experiments show that our proposed Sim DINO model families can achieve better performance and learn representations of higher quality than the original DINO families while being significantly simpler and more robust to variations in hyperparameters and architecture. In this section, we empirically investigate and evaluate our proposed Sim DINO and Sim DINOv2 models and compare them to the original DINO and DINOv2 model families. In particular, we examine their differences in training dynamics and learned representation both quantitatively and qualitatively. We provide quantitative results in Table 1, 2, 3 and qualitative results in Figure 3, 5. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2Transc Engram 3Microsoft Research 4HKU. Correspondence to: Ziyang Wu <EMAIL>. |
| Pseudocode | Yes | The overall pipeline is shown in Figure 1(b). Note that it is much simpler than DINO. We provide pseudocode for the training pipeline in Algorithm 1 in Appendix D. We provide pseudocode for the training pipeline in Algorithm 2 in Appendix D. |
| Open Source Code | Yes | Code and model checkpoints are available at https: //github.com/Robin Wu218/Sim DINO. |
| Open Datasets | Yes | For pretraining, we use the Image Net-1K dataset across all methods. Specifically, we evaluate our pretrained models on 1) unsupervised object detection and segmentation on COCO val2017 (Lin et al., 2014), 2) semantic segmentation on ADE20K (Zhou et al., 2017), and 3) video object segmentation on DAVIS-2017 (Pont-Tuset et al., 2017). |
| Dataset Splits | Yes | We report the classification accuracies on Image Net-1k in Table 1. Following (Caron et al., 2021), we evaluate both k-NN and linear accuracy on the Vi T backbones pretrained by the DINO model families and our simplified variants. To quantitatively evaluate these representation, we perform Mask Cut on the COCO val2017 dataset and report our results in Table 2. Specifically, we follow the linear evaluation protocol of (Zhou et al., 2021), where we fix the pretrained backbone and only finetune a linear layer on top of it. We follow the same evaluation protocol as in (Caron et al., 2021) and segment scenes between consecutive video frames with nearest neighbor. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using Adam W as an optimizer and bfloat16 dtype, but it does not provide specific version numbers for any key software libraries, frameworks, or environments (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Specifically, for all inputs we set patch size to be 16; we use the small, base, and large models of the Vi T (Dosovitskiy, 2020) architecture as the backbone, which is connected to a projector composed of three MLP layers with a hidden size of 2048 and an output dimension of 256. For multicrop augmentation, we use 10 local views of resolution 96 96 and 2 global views of resolution 224 224 for all experiments. We provide more details on hyperparameter choices in Appendix E. Appendix E (Table 4) provides extensive details including: Patch size 16, Init EMA momentum 0.996, Global crops scale 0.4 1, Local crops number 10, Batch size 128x8, Epochs 100, Learning rate 0.004, Weight decay 0.04, Gradient clip 3.0. |