ZerO Initialization: Initializing Neural Networks with only Zeros and Ones

Authors: Jiawei Zhao, Florian Tobias Schaefer, Anima Anandkumar

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through both theoretical and empirical studies, we demonstrate that Zer O is able to train networks without damaging their expressivity. Applying Zer O on Res Net achieves state-of-the-art performance on various datasets, including Image Net, which suggests random weights may be unnecessary for network initialization. In this section, we empirically benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). As shown in Table 2, Zer O achieves state-of-the-art accuracy on both datasets compared to other random methods.
Researcher Affiliation Collaboration Jiawei Zhao EMAIL California Institute of Technology Florian Schäfer florianEMAIL Georgia Institute of Technology Anima Anandkumar EMAIL California Institute of Technology NVIDIA
Pseudocode Yes Algorithm 1 Zer O Initialization. Input: a neural network F with L matrices Wl œ RPl Ql for l in 1, ..., L. Iú is partial identity matrix defined in Definition 1. Hm is the Hadamard matrix defined in Definition 2. For l in 1, ..., L: If Pl = Ql: Wl Ω I Û Identity mapping If Pl < Ql: Wl Ω Iú Û Propagate the first Pl dimensions If Pl > Ql: Wl Ω c IúHm Iú, where m = Álog2(Pl)Ë and c = 2 (m 1)/2 Û Apply Hadamard matrix. Algorithm 2 Zer O Initialization on Convolution. Input: number of input channels cin, number of output channels cout, odd kernel size k. Return: a cout cin k k convolutional kernel K. Let n Ω Âk/2Ê If cout = cin: K[:, :, n, n] Ω I If cout < cin: K[:, :, n, n] Ω Iú If cout > cin: K[:, :, n, n] Ω c IúHm Iú, where m = Álog2(Pl)Ë and c = 2 (m 1)/2
Open Source Code Yes 1Code repository: https://github.com/jiaweizzhao/Zer O-initialization.
Open Datasets Yes In this section, we empirically benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). We also apply Zer O to Transformer and evaluate it on Wiki Text-2 dataset (Vaswani et al., 2017).
Dataset Splits Yes We benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). Both Res Net structures follow the design from He et al. (2016), which includes batch normalization by default. We warm up the learning rate with 5 and 10 epochs for Image Net and CIFAR-10, respectively.
Hardware Specification No We are grateful to the anonymous reviewers for their helpful comments and NVIDIA for the computational support.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes Hyperparameter settings. We find that Zer O can fully utilize the default hyperparameters, which include a learning rate of 0.1, a momentum of 0.9, and a weight decay of 0.0001. In addition, we observe the learning rate warmup is essential for Zer O to achieve a large maximal learning rate, as most of the weights start from the exact zero. We warm up the learning rate with 5 and 10 epochs for Image Net and CIFAR-10, respectively.