Initialization of ReLUs for Dynamical Isometry
Authors: Rebekka Burkholz, Alina Dubatovka
NeurIPS 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train fully-connected Re LU feed forward networks of different depth consisting of L = 1, . . . , 10 hidden layers with the same number of neurons Nl = N = 100, 300, 500 and an additional softmax classification layer on MNIST [10] and CIFAR-10 [9] to compare three different initialization schemes: the standard He initialization and our two proposals in Sec. 3, i.e., GSM and orthogonal weights. |
| Researcher Affiliation | Academia | Rebekka Burkholz Department of Biostatistics Harvard T.H. Chan School of Public Health 655 Huntington Avenue, Boston, MA 02115 EMAIL Alina Dubatovka Department of Computer Science ETH Zurich Universitätstrasse 6, 8092 Zurich EMAIL |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We train fully-connected Re LU feed forward networks of different depth... on MNIST [10] and CIFAR-10 [9] |
| Dataset Splits | No | The paper uses MNIST and CIFAR-10 datasets but does not explicitly provide details about training/validation/test dataset splits, specific percentages, or how samples were divided for reproducibility. |
| Hardware Specification | Yes | Each experiment on MNIST was run on 1 Nvidia GTX 1080 Ti GPU, while each experiment on CIFAR-10 was performed on 4 Nvidia GTX 1080 Ti GPUs. |
| Software Dependencies | No | The paper does not specify the version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We train fully-connected Re LU feed forward networks of different depth consisting of L = 1, . . . , 10 hidden layers with the same number of neurons Nl = N = 100, 300, 500 and an additional softmax classification layer... We focus on minimizing the cross-entropy by Stochastic Gradient Descent (SGD) without batch normalization or any data augmentation techniques... we adapt the learning rate to (0.0001 + 0.003 exp( step/104))/L for MNIST and (0.00001 + 0.0005 exp( step/104))/L for CIFAR-10 for 104 SGD steps with a batch size of 100 in all cases. |