reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

If your data distribution shifts, use self-learning

Authors: Evgenia Rusak, Steffen Schneider, George Pachitariu, Luisa Eck, Peter Vincent Gehler, Oliver Bringmann, Wieland Brendel, Matthias Bethge

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a wide range of large-scale experiments and show consistent improvements irrespective of the model architecture, the pre-training technique or the type of distribution shift. At the same time, self-learning is simple to use in practice because it does not require knowledge or access to the original training data or scheme, is robust to hyperparameter choices, is straight-forward to implement and requires only a few adaptation epochs. This makes self-learning techniques highly attractive for any practitioner who applies machine learning algorithms in the real world. We present state-of-the-art adaptation results on CIFAR10-C (8.5% error), Image Net-C (22.0% m CE), Image Net-R (17.4% error) and Image Net-A (14.8% error), theoretically study the dynamics of selfsupervised adaptation methods and propose a new classification dataset (Image Net-D) which is challenging even with adaptation.
Researcher Affiliation	Collaboration	Evgenia Rusak EMAIL University of Tübingen Steffen Schneider EMAIL University of Tübingen George Pachitariu EMAIL University of Tübingen Luisa Eck EMAIL University of Oxford Peter Gehler EMAIL Amazon Tübingen Oliver Bringmann EMAIL University of Tübingen Wieland Brendel EMAIL Max-Planck Institute for Intelligent Systems Tübingen Matthias Bethge EMAIL University of Tübingen
Pseudocode	No	The paper describes the self-learning variants using mathematical equations (1, 2, 3, 4) and descriptive text, for example: Hard Pseudo-Labeling (Lee, 2013; Galstyan & Cohen, 2007). We generate labels using the teacher and train the student on pseudo-labels i using the softmax cross-entropy loss, ℓH(x) := log ps(i\|x), i = argmaxj pt(j\|x) (1) However, there are no clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Reproducibility Statement We attempted to make our work as reproducible as possible: We mostly used pre-trained models which are publicly available and we denoted the URL addresses of all used checkpoints; for the checkpoints that were necessary to retrain, we report the Github directories with the source code and used an official or verified reference implementation when available. We report all used hyperparameters in the Appendix and released the code. Software and Data Code for reproducing results of this paper is available at https://github.com/bethgelab/robustness.
Open Datasets	Yes	Datasets. Image Net-C (IN-C; Hendrycks & Dietterich, 2019) contains corrupted versions of the 50 000 images in the Image Net validation set. ... Image Net-R (IN-R; Hendrycks et al., 2020a) contains 30 000 images with artistic renditions of 200 classes of the Image Net dataset. Image Net-A (IN-A; Hendrycks et al., 2019) is composed of 7500 unmodified real-world images... CIFAR10 (Krizhevsky et al., 2009) and STL10 (Coates et al., 2011) are small-scale image recognition datasets... The digit datasets MNIST (Deng, 2012) and MNIST-M (Ganin et al., 2016) both have 60 000 training and 10 000 test images. ... we propose Image Net-D as a new benchmark, which we analyse in 8. ... make IN-D publicly available as an easy to use dataset for this purpose.
Dataset Splits	Yes	Image Net-C (IN-C; Hendrycks & Dietterich, 2019) contains corrupted versions of the 50 000 images in the Image Net validation set. There are fifteen test and four hold-out corruptions, and there are five severity levels for each corruption. ...CIFAR10 (Krizhevsky et al., 2009) and STL10 (Coates et al., 2011) are small-scale image recognition datasets with 10 classes each, and training sets of 50 000/5000 images and test sets of 10 000/8000 images, respectively. The digit datasets MNIST (Deng, 2012) and MNIST-M (Ganin et al., 2016) both have 60 000 training and 10 000 test images. ...To this end, we optimize hyperparameters for each variant of pseudo-labeling on a hold-out set of IN-C that contains four types of image corruptions ( speckle noise , Gaussian blur , saturate and spatter ) with five different strengths each, following the procedure suggested in Hendrycks & Dietterich (2019). We refer to the hold-out set of IN-C as our dev set. On the small-scale datasets, we use the hold-out set of CIFAR10-C for hyperparameter tuning.
Hardware Specification	Yes	The results in Table 29 show that for a Res Net50 model, higher batch size yields a generally better performance. Table 29: Image Net-C dev set m CE for various batch sizes with linear learning rate scaling. ... affine adaptation experiments on Res Net50 scale can be run with batch size 128 on a Nvidia V100 GPU (16GB), while only batch size 96 experiments are possible on RTX 2080 GPUs.
Software Dependencies	Yes	We use different open source software packages for our experiments, most notably Docker (Merkel, 2014), scipy and numpy (Virtanen et al., 2020), GNU parallel (Tange, 2011), Tensorflow (Abadi et al., 2016), Py Torch (Paszke et al., 2017), timm (Wightman, 2019), Self-ensembling for visual domain adaptation (French et al., 2018), the WILDS benchmark (Koh et al., 2021), and torchvision (Marcel & Rodriguez, 2010).
Experiment Setup	Yes	Hyperparameters. The different self-learning variants have a range of hyperparameters such as the learning rate or the stopping criterion. ... For all experiments, we use a batch size of 128. ... We vary the learning rate in same interval as for the Res Net50 model but scale it down linearly to account for the smaller batch size of 32. ... Efficient Net-L2 ... test the learning rates 4.6 10 2, 4.6 10 3, 4.6 10 4 and 4.6 10 5. ... Table 22: The best hyperparameters for all models that we found on IN-C. For all models, we fine-tune only the affine batch normalization parameters and use q = 0.8 for RPL. The small batchsize for the Efficient Net model is due to hardware limitations. number of Model Method Learning rate batch size epochs