reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning

Authors: Anders Johan Andreassen, Yasaman Bahri, Behnam Neyshabur, Rebecca Roelofs

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study datasets with naturally occurring distribution shifts, and we conduct a thorough empirical investigation of eﬀective robustness during ﬁne-tuning and surprisingly ﬁnd that models pre-trained on larger datasets exhibit eﬀective robustness during training that vanishes at convergence. We study how properties of the data inﬂuence eﬀective robustness, and we show that it increases with the larger size, more diversity, and higher example diﬃculty of the dataset.
Researcher Affiliation	Industry	Anders Andreassen EMAIL Google Research Yasaman Bahri EMAIL Google Research Behnam Neyshabur EMAIL Google Research Rebecca Roelofs roﬂs@google.com Google Research
Pseudocode	No	The paper describes methods using mathematical equations and textual explanations, but it does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 3 defines Effective Robustness with an equation, and Section 4 details the experimental setup in text.
Open Source Code	No	The paper references third-party code and weights, such as "publicly released pre-trained weights for the Vi T-B-32 model from https://github.com/Open AI/CLIP" and "Py Torch (Paszke et al., 2019) implementation and weights coming from torchvision.models v0.8.2". However, it does not provide any explicit statement or link for the authors' own implementation code for the methodology described in the paper.
Open Datasets	Yes	Fine-tuning datasets. We evaluate ER throughout ﬁne-tuning (see Appendix B.2 for our deﬁnition of ﬁne-tuning) on both Image Net and CIFAR-10 since researchers commonly ﬁne-tune pre-trained models to these popular benchmarks. For CIFAR-10 in particular, there are several widely available pre-trained Image Net models that we can easily transfer to the CIFAR-10 dataset. Robustness benchmarks. Both CIFAR-10 and Image Net have several well-established robustness benchmarks. We choose to focus on naturally occurring distribution shifts, rather than synthetic shifts which modify images, because they are more realistic and because there are several robustness interventions that work well on synthetic shifts but do not transfer to natural distribution shifts Taori et al. (2020). Thus, for CIFAR-10, we evaluate robustness on CIFAR-10.1, which was created in the replication study of Recht et al. (2018) and is one of the few natural distribution shift benchmarks available for it; and for Image Net, we use the Image Net V2 (Recht et al., 2019), Object Net (Barbu et al., 2019), and Image Net-R (Hendrycks et al., 2020) test sets.
Dataset Splits	Yes	For Image Net, we use the publicly available testbed from Recht et al. (2019), and for CIFAR-10 we use the testbed from Recht et al. (2018). When training the full model on the CIFAR-10 training set (with image size 224x224), we additionally apply Random Crop with padding=28 and Random Horizontal Flip. Image Net. The Image Net training images are preprocessed with Random Resized Crop to 224 pixels and Random Horizontal Flip, and the test images are resized to 256 pixels and preprocessed with Center Crop to 224 pixels. Using the C-score Jiang et al. (2020) as a metric for example diﬃculty, we select 5,000 of the easiest, hardest or random examples while maintaining class balance (i.e. 500 examples from each class).
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. It mentions using PyTorch models and Big Transfer models but provides no specific details on GPUs, CPUs, or other computing resources.
Software Dependencies	Yes	We also study six Image Net pre-trained Py Torch (Paszke et al., 2019) models: Alex Net (Krizhevsky, 2014), Res Net-18, -50 and -152 (He et al., 2016a), VGG-11 with Batch Normalization (Simonyan & Zisserman, 2014), and Wide Res Net-50-2 (Zagoruyko & Komodakis, 2016). When ﬁne-tuning pre-trained Image Net models on CIFAR-10, we always replace the original classiﬁcation layer with a 10-class randomly initialized classiﬁcation layer. Our CLIP zero-shot model uses the publicly released pre-trained weights for the Vi T-B-32 model from https://github.com/Open AI/CLIP. Following the methodology from Radford et al. (2021a), we ﬁne-tune the zero-shot CLIP model using a logistic regression classiﬁer optimized with L-BFGS, and we determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets with ten logarithmically-spaced values between 10 6 and 107. All linear ﬁts throughout this work were computed by ﬁrst transforming to logit space (see Section A) and then ﬁtting using scipy.stats.linregress.
Experiment Setup	Yes	For all models we ran ﬁve runs and all were trained using stochastic gradient descent with momentum 0.9 and weight decay 10 4, unless otherwise speciﬁed. The default settings to train the full Py Torch models (see Section B.1) was using batch size 64 for 250 epochs using learning rates [0.1, 0.01] for the randomly initialized model, and for 100 epochs using learning rates [0.01, 0.001, 0.0001] for the pre-trained models. When training the Bi T models, we extract the features from the penultimate layer and only train the prediction head (except for the randomly initialized Bi T models where we train the full model). Unless otherwise speciﬁed, the models are trained using learning rate 0.001 with batch size 32768 for 100 epochs with no weight decay. Our CLIP zero-shot model uses the publicly released pre-trained weights for the Vi T-B-32 model from https://github.com/Open AI/CLIP. Following the methodology from Radford et al. (2021a), we ﬁne-tune the zero-shot CLIP model using a logistic regression classiﬁer optimized with L-BFGS, and we determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets with ten logarithmically-spaced values between 10 6 and 107.