reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ZerO Initialization: Initializing Neural Networks with only Zeros and Ones

Authors: Jiawei Zhao, Florian Tobias Schaefer, Anima Anandkumar

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through both theoretical and empirical studies, we demonstrate that Zer O is able to train networks without damaging their expressivity. Applying Zer O on Res Net achieves state-of-the-art performance on various datasets, including Image Net, which suggests random weights may be unnecessary for network initialization. In this section, we empirically benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). As shown in Table 2, Zer O achieves state-of-the-art accuracy on both datasets compared to other random methods.
Researcher Affiliation	Collaboration	Jiawei Zhao EMAIL California Institute of Technology Florian Schäfer ﬂorianEMAIL Georgia Institute of Technology Anima Anandkumar EMAIL California Institute of Technology NVIDIA
Pseudocode	Yes	Algorithm 1 Zer O Initialization. Input: a neural network F with L matrices Wl œ RPl Ql for l in 1, ..., L. Iú is partial identity matrix deﬁned in Deﬁnition 1. Hm is the Hadamard matrix deﬁned in Deﬁnition 2. For l in 1, ..., L: If Pl = Ql: Wl Ω I Û Identity mapping If Pl < Ql: Wl Ω Iú Û Propagate the ﬁrst Pl dimensions If Pl > Ql: Wl Ω c IúHm Iú, where m = Álog2(Pl)Ë and c = 2 (m 1)/2 Û Apply Hadamard matrix. Algorithm 2 Zer O Initialization on Convolution. Input: number of input channels cin, number of output channels cout, odd kernel size k. Return: a cout cin k k convolutional kernel K. Let n Ω Âk/2Ê If cout = cin: K[:, :, n, n] Ω I If cout < cin: K[:, :, n, n] Ω Iú If cout > cin: K[:, :, n, n] Ω c IúHm Iú, where m = Álog2(Pl)Ë and c = 2 (m 1)/2
Open Source Code	Yes	1Code repository: https://github.com/jiaweizzhao/Zer O-initialization.
Open Datasets	Yes	In this section, we empirically benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). We also apply Zer O to Transformer and evaluate it on Wiki Text-2 dataset (Vaswani et al., 2017).
Dataset Splits	Yes	We benchmark Zer O on CIFAR-10 and Image Net datasets, where we evaluate Res Net-18 on CIFAR-10 and Res Net-50 on Image Net (Krizhevsky, 2009; Deng et al., 2009). Both Res Net structures follow the design from He et al. (2016), which includes batch normalization by default. We warm up the learning rate with 5 and 10 epochs for Image Net and CIFAR-10, respectively.
Hardware Specification	No	We are grateful to the anonymous reviewers for their helpful comments and NVIDIA for the computational support.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	Hyperparameter settings. We ﬁnd that Zer O can fully utilize the default hyperparameters, which include a learning rate of 0.1, a momentum of 0.9, and a weight decay of 0.0001. In addition, we observe the learning rate warmup is essential for Zer O to achieve a large maximal learning rate, as most of the weights start from the exact zero. We warm up the learning rate with 5 and 10 epochs for Image Net and CIFAR-10, respectively.