Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra
Authors: Vardan Papyan
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerous researchers recently applied empirical spectral analysis to the study of modern deep learning classifiers. We identify and discuss an important formal class/cross-class structure and show how it lies at the origin of the many visually striking features observed in deep neural network spectra, some of which were reported in recent articles, others are unveiled here for the first time. These include spectral outliers, spikes , and small but distinct continuous distributions, bumps , often seen beyond the edge of a main bulk. ... In this section, we confirm their reports, this time at the full scale of modern state-of-the-art networks trained on real natural images. We release software implementing state-of-the-art tools in numerical linear algebra, which allows one to approximate efficiently the spectrum of the Hessian of modern deep neural networks such as VGG and Res Net. |
| Researcher Affiliation | Academia | Vardan Papyan EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA |
| Pseudocode | Yes | Algorithm 1: Slow Lanczos(A) ... Algorithm 2: Fast Lanczos(A, M) ... Algorithm 3: Lanczos Approx Spec (A, M, K, nvec, κ) ... Algorithm 4: Normalization(A, M0, τ) ... Algorithm 5: Subspace Iteration(A, C, T) |
| Open Source Code | Yes | We release software implementing state-of-the-art tools in numerical linear algebra, which allows one to approximate efficiently the spectrum of the Hessian of modern deep neural networks such as VGG and Res Net. We describe its functionality in Appendix C. |
| Open Datasets | Yes | trained on various datasets. The top-C eigenspace was estimated precisely using Low Rank Deflation (built upon the power method) and the rest of the spectrum was approximated using Lanczos Approx Spec. Both are described in Appendix C, where we also show the same plots except without first applying Low Rank Deflation. We observe a clear bulk-and-outliers structure with, arguably, C outliers. ... We present here results from training the VGG11 (Simonyan and Zisserman, 2014) and Res Net18 (He et al., 2016) architectures on the MNIST (Le Cun et al., 2010), Fashion MNIST (Xiao et al., 2017), CIFAR10 and CIFAR100 (Krizhevsky and Hinton, 2009) datasets. |
| Dataset Splits | Yes | Figure 3 we plot the spectra of the train and test Hessian of VGG11, an architecture with 28 million parameters, trained on various datasets. The top-C eigenspace was estimated precisely using Low Rank Deflation (built upon the power method) and the rest of the spectrum was approximated using Lanczos Approx Spec. ... For each dataset and network, we repeat the previous experiments on 20 training sample sizes logarithmically spaced in the range [10, 5000]. |
| Hardware Specification | No | Some of the computing for this project was performed on the Sherlock cluster at Stanford University; we thank the Stanford Research Computing Center for providing computational resources and support that enabled our research. Some of this project was also performed on Google Cloud Platform: thanks to Google Cloud Platform Education Grants Program for research credits that supplemented this work. |
| Software Dependencies | No | The methods we employ in this paper including Lanczos and subspace iteration assume deterministic linear operators. As such, we train our networks without preprocessing the input data using random flips or crops. Moreover, we replace dropout layers (Srivastava et al., 2014) with batch normalization ones (Ioffe and Szegedy, 2015) in the VGG architecture. The batch normalization layers are always set to test mode . |
| Experiment Setup | Yes | We use stochastic gradient descent with 0.9 momentum, 5 10 4 weight decay and 128 batch size. We train for 200 epochs in the case of MNIST and Fashion MNIST and 350 in the case of CIFAR10 and CIFAR100, annealing the initial learning rate by a factor of 10 at 1/3 and 2/3 of the number of epochs. For each dataset and network, we sweep over 100 logarithmically spaced initial learning rates in the range [0.25, 0.0001] and pick the one that results in the best test error in the last epoch. For each dataset and network, we repeat the previous experiments on 20 training sample sizes logarithmically spaced in the range [10, 5000]. We also train an eight-layer multilayer perceptrons (MLP) with 2048 neurons in each hidden layer on the same datasets. We use the same hyperparameters, except we train for 350 epochs for all datasets and optimize the initial learning rate over 25 logarithmically spaced values. |