The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning

Authors: Anders Johan Andreassen, Yasaman Bahri, Behnam Neyshabur, Rebecca Roelofs

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study datasets with naturally occurring distribution shifts, and we conduct a thorough empirical investigation of effective robustness during fine-tuning and surprisingly find that models pre-trained on larger datasets exhibit effective robustness during training that vanishes at convergence. We study how properties of the data influence effective robustness, and we show that it increases with the larger size, more diversity, and higher example difficulty of the dataset.
Researcher Affiliation Industry Anders Andreassen EMAIL Google Research Yasaman Bahri EMAIL Google Research Behnam Neyshabur EMAIL Google Research Rebecca Roelofs rofls@google.com Google Research
Pseudocode No The paper describes methods using mathematical equations and textual explanations, but it does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 3 defines Effective Robustness with an equation, and Section 4 details the experimental setup in text.
Open Source Code No The paper references third-party code and weights, such as "publicly released pre-trained weights for the Vi T-B-32 model from https://github.com/Open AI/CLIP" and "Py Torch (Paszke et al., 2019) implementation and weights coming from torchvision.models v0.8.2". However, it does not provide any explicit statement or link for the authors' own implementation code for the methodology described in the paper.
Open Datasets Yes Fine-tuning datasets. We evaluate ER throughout fine-tuning (see Appendix B.2 for our definition of fine-tuning) on both Image Net and CIFAR-10 since researchers commonly fine-tune pre-trained models to these popular benchmarks. For CIFAR-10 in particular, there are several widely available pre-trained Image Net models that we can easily transfer to the CIFAR-10 dataset. Robustness benchmarks. Both CIFAR-10 and Image Net have several well-established robustness benchmarks. We choose to focus on naturally occurring distribution shifts, rather than synthetic shifts which modify images, because they are more realistic and because there are several robustness interventions that work well on synthetic shifts but do not transfer to natural distribution shifts Taori et al. (2020). Thus, for CIFAR-10, we evaluate robustness on CIFAR-10.1, which was created in the replication study of Recht et al. (2018) and is one of the few natural distribution shift benchmarks available for it; and for Image Net, we use the Image Net V2 (Recht et al., 2019), Object Net (Barbu et al., 2019), and Image Net-R (Hendrycks et al., 2020) test sets.
Dataset Splits Yes For Image Net, we use the publicly available testbed from Recht et al. (2019), and for CIFAR-10 we use the testbed from Recht et al. (2018). When training the full model on the CIFAR-10 training set (with image size 224x224), we additionally apply Random Crop with padding=28 and Random Horizontal Flip. Image Net. The Image Net training images are preprocessed with Random Resized Crop to 224 pixels and Random Horizontal Flip, and the test images are resized to 256 pixels and preprocessed with Center Crop to 224 pixels. Using the C-score Jiang et al. (2020) as a metric for example difficulty, we select 5,000 of the easiest, hardest or random examples while maintaining class balance (i.e. 500 examples from each class).
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments. It mentions using PyTorch models and Big Transfer models but provides no specific details on GPUs, CPUs, or other computing resources.
Software Dependencies Yes We also study six Image Net pre-trained Py Torch (Paszke et al., 2019) models: Alex Net (Krizhevsky, 2014), Res Net-18, -50 and -152 (He et al., 2016a), VGG-11 with Batch Normalization (Simonyan & Zisserman, 2014), and Wide Res Net-50-2 (Zagoruyko & Komodakis, 2016). When fine-tuning pre-trained Image Net models on CIFAR-10, we always replace the original classification layer with a 10-class randomly initialized classification layer. Our CLIP zero-shot model uses the publicly released pre-trained weights for the Vi T-B-32 model from https://github.com/Open AI/CLIP. Following the methodology from Radford et al. (2021a), we fine-tune the zero-shot CLIP model using a logistic regression classifier optimized with L-BFGS, and we determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets with ten logarithmically-spaced values between 10 6 and 107. All linear fits throughout this work were computed by first transforming to logit space (see Section A) and then fitting using scipy.stats.linregress.
Experiment Setup Yes For all models we ran five runs and all were trained using stochastic gradient descent with momentum 0.9 and weight decay 10 4, unless otherwise specified. The default settings to train the full Py Torch models (see Section B.1) was using batch size 64 for 250 epochs using learning rates [0.1, 0.01] for the randomly initialized model, and for 100 epochs using learning rates [0.01, 0.001, 0.0001] for the pre-trained models. When training the Bi T models, we extract the features from the penultimate layer and only train the prediction head (except for the randomly initialized Bi T models where we train the full model). Unless otherwise specified, the models are trained using learning rate 0.001 with batch size 32768 for 100 epochs with no weight decay. Our CLIP zero-shot model uses the publicly released pre-trained weights for the Vi T-B-32 model from https://github.com/Open AI/CLIP. Following the methodology from Radford et al. (2021a), we fine-tune the zero-shot CLIP model using a logistic regression classifier optimized with L-BFGS, and we determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets with ten logarithmically-spaced values between 10 6 and 107.