Beyond Random Augmentations: Pretraining with Hard Views
Authors: Fabio Ferreira, Ivo Rapant, Jörg Franke, Frank Hutter
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We are the first to demonstrate hard view pretraining s effectiveness at scale, particularly training on the full Image Net-1k dataset, and evaluating across multiple SSL methods, Conv Nets, and Vi Ts. As a result, HVP sets a new state-of-the-art on DINO Vi T-B/16, reaching 78.8% linear evaluation accuracy (a 0.6% improvement) and consistent gains of 1% for both 100 and 300 epoch pretraining, with similar improvements across transfer tasks in DINO, Sim Siam, i BOT, and Sim CLR. |
| Researcher Affiliation | Academia | Fabio Ferreira University of Freiburg Ivo Rapant University of Freiburg Jörg K.H. Franke University of Freiburg Frank Hutter ELLIS Institute Tübingen & University of Freiburg |
| Pseudocode | Yes | Algorithm 1 Pretraining with Hard Views |
| Open Source Code | Yes | We make our Py Torch Paszke et al. (2019) code, models, and all used hyperparameters publicly available under https://github.com/automl/hvp. |
| Open Datasets | Yes | To the best of our knowledge, we are the first to demonstrate the effectiveness of a hard view sampling strategy at scale, particularly on modern architectures like Vision Transformers (Vi Ts) and training on the full Image Net dataset. Table 2: HVP compares favorably against models trained without it when fine-tuned (F.T.) to or linearly evaluated (Lin.) on other datasets (averaged over 3 seeds; 100-ep. preraining). In Table 2, we apply both the linear evaluation (Lin.) and finetuning (F.T.) protocols to our models across a diverse set of datasets consisting of CIFAR10 (Krizhevsky, 2009), CIFAR100, Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and i Naturalist 2021 (i Naturalist 2021 competition dataset). For object detection and instance segmentation, we use the COCO (Lin et al., 2014) dataset with Cascade Mask R-CNN (Cai & Vasconcelos, 2019; He et al., 2017). |
| Dataset Splits | Yes | We report the top-1 validation accuracy on frozen features, as well as the k-NN classifier performance, in Table 1. In Table 2, we apply both the linear evaluation (Lin.) and finetuning (F.T.) protocols to our models across a diverse set of datasets consisting of CIFAR10 (Krizhevsky, 2009), CIFAR100, Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and i Naturalist 2021 (i Naturalist 2021 competition dataset). |
| Hardware Specification | Yes | We primarily ran our experiments with 8x NVIDIA Ge Force RTX 2080 Ti nodes, with which the pretraining and linear evaluation duration ranged from 3.5 to 25 days. J Computational Overhead of HVP The details of hardware and software used for this analysis are: one single compute node with 8 NVIDIA RTX 2080 Ti, AMD EPYC 7502 (32-Core Processor), 512GB RAM, Ubuntu 22.04.3 LTS, Py Torch 2.0.1, CUDA 11.8. |
| Software Dependencies | Yes | The details of hardware and software used for this analysis are: one single compute node with 8 NVIDIA RTX 2080 Ti, AMD EPYC 7502 (32-Core Processor), 512GB RAM, Ubuntu 22.04.3 LTS, Py Torch 2.0.1, CUDA 11.8. |
| Experiment Setup | Yes | For DINO, we additionally compare Res Net-50 (He et al., 2016) against the Vi T-S/16 (Dosovitskiy et al., 2020) architecture. Table 8: Pretraining Image Net hyperparameters for the runs with DINO Vi T-S/16. For 300 epochs, we use a batch size of 1024. Table 9: Pretraining Image Net hyperparameters for the runs with DINO Vi T-B/16. Table 10: Pretraining Image Net hyperparameters for the runs with Sim Siam. For 300 epochs, we use a batch size of 1024. Table 11: Pretraining Image Net hyperparameters for the runs with Sim CLR. Table 12: Finetuning hyperparameters for DINO Vi T-S/16. Table 13: Finetuning hyperparameters for Sim Siam and Res Net-50. Table 14: Hyperparameters object detection and instance segmentation on COCO. |