Accessing Vision Foundation Models via ImageNet-1K
Authors: Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, Yun Fu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), Open CLIP-L/14 (400M/2B) and Syn CLR-L/14 (600M) with a significantly smaller training set of 1.2M images. In this part, we conduct pre-training on Image Net-1K (Deng et al., 2009) and validate our methods under different setups. Distilling from DINOv2 (Oquab et al., 2023), we first conduct ablation to verify our designs. Then, we evaluate our method on the object recognition task and perform linear probing on Image Net-1K and 12 fine-grained datasets. Further, we validate our approach on dense prediction tasks, including Semantic Segmentation and Depth Estimation. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, Northeastern University 2Khoury College of Computer Science, Northeastern University EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology using mathematical equations (e.g., L = (1 λ) LCE + λ LKL, L = λtoken Ltoken + λfeat Lfeat + λpatch Lpatch) and textual explanations, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at here. |
| Open Datasets | Yes | We conduct pre-training on the training set of Image Net-1K (Deng et al., 2009), comprising approximately 1.2 million images distributed across 1,000 categories. We conduct experiments on ADE20K dataset (Zhou et al., 2017) using the single-scale setup. We evaluate the methods on the NYU-Depth V2 (Silberman et al., 2012) dataset. |
| Dataset Splits | Yes | We conduct pre-training on the training set of Image Net-1K (Deng et al., 2009), comprising approximately 1.2 million images distributed across 1,000 categories. For all Image Net linear probing results, we follow the linear probing protocol outlined in DINOv2 (Oquab et al., 2023). We conduct experiments on ADE20K dataset (Zhou et al., 2017) using the single-scale setup. We evaluate the methods on the NYU-Depth V2 (Silberman et al., 2012) dataset. Similar with semantic segmentation, we utilize a resolution of 512 512 for models trained with a patch size of 16 16, and a resolution of 518 518 for models trained with a patch size of 14 14. Appendix A.1.2, A.1.3, A.1.4, A.1.5 describe detailed evaluation protocols for ImageNet linear probing, fine-grained classification, semantic segmentation, and depth estimation, which implicitly rely on standard splits or specific methods of data usage for those benchmarks. |
| Hardware Specification | Yes | All models are training for 300 epochs with a batch size of 1024 on 8 A100 GPUs, except the Vi T-L experiment with a batch size of 256 due to GPU memory constraints. |
| Software Dependencies | No | The paper mentions following training protocols of other works (Dei T, DINOv2), using specific optimizers (SGD, L-BFGS), and task layers (Uper Net, DPT), but does not specify version numbers for any programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key software libraries. |
| Experiment Setup | Yes | We strictly follow the Dei T (Touvron et al., 2021) training protocol on Image Net-1K (Deng et al., 2009) except that we do not enable Mixup (Zhang et 2017). All models are training for 300 epochs with a batch size of 1024 on 8 A100 GPUs, except the Vi T-L experiment with a batch size of 256 due to GPU memory constraints. λtoken, λfeat, λpatch are introduced hyperparameters to balance the three terms and we simply set them to 1 in our implementations without additional fine-tuning. For all Image Net linear probing results, we follow the linear probing protocol outlined in DINOv2 (Oquab et al., 2023). Specifically, the linear layer is trained using the SGD optimizer for 25,020 iterations with a batch size of 512. We employ random-resized-crop data augmentation and conduct a grid search over the hyperparameters as defined in DINOv2. The training objective is minimized using L-BFGS with L2-regularization and the L2-regularization constant is chosen on the validation set from 45 logarithmically spaced values between 10 6 and 103. Unlike DINOv2 (Oquab et al., 2023) and Syn CLR, We reduce the maximum number of L-BFGS iterations from 1000 to 500 for faster evaluation. |