ZerO Initialization: Initializing Neural Networks with only Zeros and Ones
Authors: Jiawei Zhao, Florian Tobias Schaefer, Anima Anandkumar
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through both theoretical and empirical studies, we demonstrate that Zer O is able to train networks without damaging their expressivity. Applying Zer O on Res Net achieves state-of-the-art performance on various datasets, including Image Net, which suggests random weights may be unnecessary for network initialization. In this section, we empirically benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). As shown in Table 2, Zer O achieves state-of-the-art accuracy on both datasets compared to other random methods. |
| Researcher Affiliation | Collaboration | Jiawei Zhao EMAIL California Institute of Technology Florian Schäfer florianEMAIL Georgia Institute of Technology Anima Anandkumar EMAIL California Institute of Technology NVIDIA |
| Pseudocode | Yes | Algorithm 1 Zer O Initialization. Input: a neural network F with L matrices Wl œ RPl Ql for l in 1, ..., L. Iú is partial identity matrix defined in Definition 1. Hm is the Hadamard matrix defined in Definition 2. For l in 1, ..., L: If Pl = Ql: Wl Ω I Û Identity mapping If Pl < Ql: Wl Ω Iú Û Propagate the first Pl dimensions If Pl > Ql: Wl Ω c IúHm Iú, where m = Álog2(Pl)Ë and c = 2 (m 1)/2 Û Apply Hadamard matrix. Algorithm 2 Zer O Initialization on Convolution. Input: number of input channels cin, number of output channels cout, odd kernel size k. Return: a cout cin k k convolutional kernel K. Let n Ω Âk/2Ê If cout = cin: K[:, :, n, n] Ω I If cout < cin: K[:, :, n, n] Ω Iú If cout > cin: K[:, :, n, n] Ω c IúHm Iú, where m = Álog2(Pl)Ë and c = 2 (m 1)/2 |
| Open Source Code | Yes | 1Code repository: https://github.com/jiaweizzhao/Zer O-initialization. |
| Open Datasets | Yes | In this section, we empirically benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). We also apply Zer O to Transformer and evaluate it on Wiki Text-2 dataset (Vaswani et al., 2017). |
| Dataset Splits | Yes | We benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). Both Res Net structures follow the design from He et al. (2016), which includes batch normalization by default. We warm up the learning rate with 5 and 10 epochs for Image Net and CIFAR-10, respectively. |
| Hardware Specification | No | We are grateful to the anonymous reviewers for their helpful comments and NVIDIA for the computational support. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | Hyperparameter settings. We find that Zer O can fully utilize the default hyperparameters, which include a learning rate of 0.1, a momentum of 0.9, and a weight decay of 0.0001. In addition, we observe the learning rate warmup is essential for Zer O to achieve a large maximal learning rate, as most of the weights start from the exact zero. We warm up the learning rate with 5 and 10 epochs for Image Net and CIFAR-10, respectively. |