reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are All Layers Created Equal?

Authors: Chiyuan Zhang, Samy Bengio, Yoram Singer

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we focus on layer-wise functional structure and behavior in overparameterized deep models. To do so, we study empirically the layers robustness to post-training re-initialization and re-randomization of the parameters. We provide experimental results which give evidence for the heterogeneity of layers. Morally, layers of large deep neural networks can be categorized as either robust or critical.
Researcher Affiliation	Industry	Chiyuan Zhang EMAIL Google, Mountain View, CA; Samy Bengio EMAIL Google, Mountain View, CA; Yoram Singer EMAIL Google, Mountain View, CA
Pseudocode	No	The paper describes methods and definitions in prose and mathematical formulations (e.g., Definition 1, Theorem 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "In our experiments, we take the default initialization schemes used in open source deep learning libraries." and references third-party tools like "https://github.com/google/sentencepiece" and "https://github.com/google-research/vision transformer". However, it does not provide any explicit statement or link for the source code of the methodology described in this paper.
Open Datasets	Yes	The datasets we use in our robustness study are standard image classiﬁcation benchmarks: MNIST, CIFAR-10, and Image Net. All networks were trained using SGD with momentum using a piecewise constant learning rate schedule. See Appendix A for further details. ... We train it on the LM1B (Chelba et al., 2013) dataset... Since two variants of Image Net datasets (Deng et al., 2009) were used when training those models, in this subsection, we will spell out the variants as Image Net-21k the larger dataset containing 21,000 classes and 14M training images, and Image Net-1k the smaller dataset containing 1,000 classes and 1.2M training images.
Dataset Splits	Yes	Performance of the trained networks is measured in terms of the agreement between its predicted labels and the true labels on a newly observed test set. ... We then measure the robustness of this model on the validation set... We run robustness evaluation on the Image Net-1k validation set. ... During training, CIFAR-10 images are padded with 4 pixels of zeros on all sides, then randomly ﬂipped (horizontally) and cropped. Image Net images are randomly cropped during training and center-cropped during testing.
Hardware Specification	No	Batch size of 128 is used, except for Res Nets with more than 50 layers on Image Net, where batch size of 64 is used due to device memory constraints. The paper does not specify any particular GPU models, CPU types, or other hardware details used for the experiments.
Software Dependencies	No	Stochastic Gradient Descent (SGD) with a momentum of 0.9 is used to minimize the multi-class cross entropy loss. ... We train the model with Adam optimizer (Kingma and Ba, 2015)... We use the Sentence Piece tokenizer with a vocabulary size of 30,000. The paper mentions optimizers and a tokenizer but does not provide specific version numbers for any software frameworks or libraries (e.g., Python, TensorFlow, PyTorch, CUDA).
Experiment Setup	Yes	Stochastic Gradient Descent (SGD) with a momentum of 0.9 is used to minimize the multi-class cross entropy loss. Each model is trained for 100 epochs, using a stage-wise constant learning rate scheduling with a multiplicative factor of 0.2 on epoch 30, 60 and 90. Batch size of 128 is used, except for Res Nets with more than 50 layers on Image Net, where batch size of 64 is used due to device memory constraints. We train the model with Adam optimizer (Kingma and Ba, 2015) for 20 epochs. ... Global mean and standard deviation are computed on all the training pixels and applied to normalize the inputs on each dataset.