reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra

Authors: Vardan Papyan

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerous researchers recently applied empirical spectral analysis to the study of modern deep learning classiﬁers. We identify and discuss an important formal class/cross-class structure and show how it lies at the origin of the many visually striking features observed in deep neural network spectra, some of which were reported in recent articles, others are unveiled here for the ﬁrst time. These include spectral outliers, spikes , and small but distinct continuous distributions, bumps , often seen beyond the edge of a main bulk. ... In this section, we conﬁrm their reports, this time at the full scale of modern state-of-the-art networks trained on real natural images. We release software implementing state-of-the-art tools in numerical linear algebra, which allows one to approximate eﬃciently the spectrum of the Hessian of modern deep neural networks such as VGG and Res Net.
Researcher Affiliation	Academia	Vardan Papyan EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA
Pseudocode	Yes	Algorithm 1: Slow Lanczos(A) ... Algorithm 2: Fast Lanczos(A, M) ... Algorithm 3: Lanczos Approx Spec (A, M, K, nvec, κ) ... Algorithm 4: Normalization(A, M0, τ) ... Algorithm 5: Subspace Iteration(A, C, T)
Open Source Code	Yes	We release software implementing state-of-the-art tools in numerical linear algebra, which allows one to approximate eﬃciently the spectrum of the Hessian of modern deep neural networks such as VGG and Res Net. We describe its functionality in Appendix C.
Open Datasets	Yes	trained on various datasets. The top-C eigenspace was estimated precisely using Low Rank Deflation (built upon the power method) and the rest of the spectrum was approximated using Lanczos Approx Spec. Both are described in Appendix C, where we also show the same plots except without ﬁrst applying Low Rank Deflation. We observe a clear bulk-and-outliers structure with, arguably, C outliers. ... We present here results from training the VGG11 (Simonyan and Zisserman, 2014) and Res Net18 (He et al., 2016) architectures on the MNIST (Le Cun et al., 2010), Fashion MNIST (Xiao et al., 2017), CIFAR10 and CIFAR100 (Krizhevsky and Hinton, 2009) datasets.
Dataset Splits	Yes	Figure 3 we plot the spectra of the train and test Hessian of VGG11, an architecture with 28 million parameters, trained on various datasets. The top-C eigenspace was estimated precisely using Low Rank Deflation (built upon the power method) and the rest of the spectrum was approximated using Lanczos Approx Spec. ... For each dataset and network, we repeat the previous experiments on 20 training sample sizes logarithmically spaced in the range [10, 5000].
Hardware Specification	No	Some of the computing for this project was performed on the Sherlock cluster at Stanford University; we thank the Stanford Research Computing Center for providing computational resources and support that enabled our research. Some of this project was also performed on Google Cloud Platform: thanks to Google Cloud Platform Education Grants Program for research credits that supplemented this work.
Software Dependencies	No	The methods we employ in this paper including Lanczos and subspace iteration assume deterministic linear operators. As such, we train our networks without preprocessing the input data using random ﬂips or crops. Moreover, we replace dropout layers (Srivastava et al., 2014) with batch normalization ones (Ioﬀe and Szegedy, 2015) in the VGG architecture. The batch normalization layers are always set to test mode .
Experiment Setup	Yes	We use stochastic gradient descent with 0.9 momentum, 5 10 4 weight decay and 128 batch size. We train for 200 epochs in the case of MNIST and Fashion MNIST and 350 in the case of CIFAR10 and CIFAR100, annealing the initial learning rate by a factor of 10 at 1/3 and 2/3 of the number of epochs. For each dataset and network, we sweep over 100 logarithmically spaced initial learning rates in the range [0.25, 0.0001] and pick the one that results in the best test error in the last epoch. For each dataset and network, we repeat the previous experiments on 20 training sample sizes logarithmically spaced in the range [10, 5000]. We also train an eight-layer multilayer perceptrons (MLP) with 2048 neurons in each hidden layer on the same datasets. We use the same hyperparameters, except we train for 350 epochs for all datasets and optimize the initial learning rate over 25 logarithmically spaced values.