Beyond Random Augmentations: Pretraining with Hard Views

Authors: Fabio Ferreira, Ivo Rapant, Jörg Franke, Frank Hutter

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We are the first to demonstrate hard view pretraining s effectiveness at scale, particularly training on the full Image Net-1k dataset, and evaluating across multiple SSL methods, Conv Nets, and Vi Ts. As a result, HVP sets a new state-of-the-art on DINO Vi T-B/16, reaching 78.8% linear evaluation accuracy (a 0.6% improvement) and consistent gains of 1% for both 100 and 300 epoch pretraining, with similar improvements across transfer tasks in DINO, Sim Siam, i BOT, and Sim CLR.
Researcher Affiliation Academia Fabio Ferreira University of Freiburg Ivo Rapant University of Freiburg Jörg K.H. Franke University of Freiburg Frank Hutter ELLIS Institute Tübingen & University of Freiburg
Pseudocode Yes Algorithm 1 Pretraining with Hard Views
Open Source Code Yes We make our Py Torch Paszke et al. (2019) code, models, and all used hyperparameters publicly available under https://github.com/automl/hvp.
Open Datasets Yes To the best of our knowledge, we are the first to demonstrate the effectiveness of a hard view sampling strategy at scale, particularly on modern architectures like Vision Transformers (Vi Ts) and training on the full Image Net dataset. Table 2: HVP compares favorably against models trained without it when fine-tuned (F.T.) to or linearly evaluated (Lin.) on other datasets (averaged over 3 seeds; 100-ep. preraining). In Table 2, we apply both the linear evaluation (Lin.) and finetuning (F.T.) protocols to our models across a diverse set of datasets consisting of CIFAR10 (Krizhevsky, 2009), CIFAR100, Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and i Naturalist 2021 (i Naturalist 2021 competition dataset). For object detection and instance segmentation, we use the COCO (Lin et al., 2014) dataset with Cascade Mask R-CNN (Cai & Vasconcelos, 2019; He et al., 2017).
Dataset Splits Yes We report the top-1 validation accuracy on frozen features, as well as the k-NN classifier performance, in Table 1. In Table 2, we apply both the linear evaluation (Lin.) and finetuning (F.T.) protocols to our models across a diverse set of datasets consisting of CIFAR10 (Krizhevsky, 2009), CIFAR100, Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and i Naturalist 2021 (i Naturalist 2021 competition dataset).
Hardware Specification Yes We primarily ran our experiments with 8x NVIDIA Ge Force RTX 2080 Ti nodes, with which the pretraining and linear evaluation duration ranged from 3.5 to 25 days. J Computational Overhead of HVP The details of hardware and software used for this analysis are: one single compute node with 8 NVIDIA RTX 2080 Ti, AMD EPYC 7502 (32-Core Processor), 512GB RAM, Ubuntu 22.04.3 LTS, Py Torch 2.0.1, CUDA 11.8.
Software Dependencies Yes The details of hardware and software used for this analysis are: one single compute node with 8 NVIDIA RTX 2080 Ti, AMD EPYC 7502 (32-Core Processor), 512GB RAM, Ubuntu 22.04.3 LTS, Py Torch 2.0.1, CUDA 11.8.
Experiment Setup Yes For DINO, we additionally compare Res Net-50 (He et al., 2016) against the Vi T-S/16 (Dosovitskiy et al., 2020) architecture. Table 8: Pretraining Image Net hyperparameters for the runs with DINO Vi T-S/16. For 300 epochs, we use a batch size of 1024. Table 9: Pretraining Image Net hyperparameters for the runs with DINO Vi T-B/16. Table 10: Pretraining Image Net hyperparameters for the runs with Sim Siam. For 300 epochs, we use a batch size of 1024. Table 11: Pretraining Image Net hyperparameters for the runs with Sim CLR. Table 12: Finetuning hyperparameters for DINO Vi T-S/16. Table 13: Finetuning hyperparameters for Sim Siam and Res Net-50. Table 14: Hyperparameters object detection and instance segmentation on COCO.