Are All Layers Created Equal?

Authors: Chiyuan Zhang, Samy Bengio, Yoram Singer

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we focus on layer-wise functional structure and behavior in overparameterized deep models. To do so, we study empirically the layers robustness to post-training re-initialization and re-randomization of the parameters. We provide experimental results which give evidence for the heterogeneity of layers. Morally, layers of large deep neural networks can be categorized as either robust or critical.
Researcher Affiliation Industry Chiyuan Zhang EMAIL Google, Mountain View, CA; Samy Bengio EMAIL Google, Mountain View, CA; Yoram Singer EMAIL Google, Mountain View, CA
Pseudocode No The paper describes methods and definitions in prose and mathematical formulations (e.g., Definition 1, Theorem 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states: "In our experiments, we take the default initialization schemes used in open source deep learning libraries." and references third-party tools like "https://github.com/google/sentencepiece" and "https://github.com/google-research/vision transformer". However, it does not provide any explicit statement or link for the source code of the methodology described in this paper.
Open Datasets Yes The datasets we use in our robustness study are standard image classification benchmarks: MNIST, CIFAR-10, and Image Net. All networks were trained using SGD with momentum using a piecewise constant learning rate schedule. See Appendix A for further details. ... We train it on the LM1B (Chelba et al., 2013) dataset... Since two variants of Image Net datasets (Deng et al., 2009) were used when training those models, in this subsection, we will spell out the variants as Image Net-21k the larger dataset containing 21,000 classes and 14M training images, and Image Net-1k the smaller dataset containing 1,000 classes and 1.2M training images.
Dataset Splits Yes Performance of the trained networks is measured in terms of the agreement between its predicted labels and the true labels on a newly observed test set. ... We then measure the robustness of this model on the validation set... We run robustness evaluation on the Image Net-1k validation set. ... During training, CIFAR-10 images are padded with 4 pixels of zeros on all sides, then randomly flipped (horizontally) and cropped. Image Net images are randomly cropped during training and center-cropped during testing.
Hardware Specification No Batch size of 128 is used, except for Res Nets with more than 50 layers on Image Net, where batch size of 64 is used due to device memory constraints. The paper does not specify any particular GPU models, CPU types, or other hardware details used for the experiments.
Software Dependencies No Stochastic Gradient Descent (SGD) with a momentum of 0.9 is used to minimize the multi-class cross entropy loss. ... We train the model with Adam optimizer (Kingma and Ba, 2015)... We use the Sentence Piece tokenizer with a vocabulary size of 30,000. The paper mentions optimizers and a tokenizer but does not provide specific version numbers for any software frameworks or libraries (e.g., Python, TensorFlow, PyTorch, CUDA).
Experiment Setup Yes Stochastic Gradient Descent (SGD) with a momentum of 0.9 is used to minimize the multi-class cross entropy loss. Each model is trained for 100 epochs, using a stage-wise constant learning rate scheduling with a multiplicative factor of 0.2 on epoch 30, 60 and 90. Batch size of 128 is used, except for Res Nets with more than 50 layers on Image Net, where batch size of 64 is used due to device memory constraints. We train the model with Adam optimizer (Kingma and Ba, 2015) for 20 epochs. ... Global mean and standard deviation are computed on all the training pixels and applied to normalize the inputs on each dataset.