Batch Normalization Preconditioning for Neural Network Training
Authors: Susanna Lange, Kyle Helfrich, Qiang Ye
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide experimental results in Section 5. ... In this section, we compare BNP with several baseline methods on several architectures for image classification tasks. We also present some exploratory experiments to study computational timing comparison as well as improved condition numbers. |
| Researcher Affiliation | Academia | Susanna Lange EMAIL Department of Mathematics, University of Kentucky Lexington, KY 40506 Kyle Helfrich EMAIL Department of Mathematics, University of Dayton Dayton, OH 45469 Qiang Ye EMAIL Department of Mathematics, University of Kentucky Lexington, KY 40506 |
| Pseudocode | Yes | Algorithm 1 Batch Normalization Bβ,γ(h) ... Algorithm 2 One Step of BNP Training on W (ℓ), b(ℓ) of the ℓth Dense Layer ... Algorithm 3 One Step of BNP Training of a Convolution Layer with weight w and bias b |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Data sets: We use MNIST, CIFAR10, CIFAR100, and Image Net data sets. The MNIST data set (Le Cun et al., 2013) ... The CIFAR10 and CIFAR100 data sets (Krizhevsky et al., 2009) ... The Image Net datatset (Russakovsky et al., 2015) |
| Dataset Splits | Yes | The MNIST data set (Le Cun et al., 2013) consists of 70,000 black and white images of handwritten digits ranging from 0 to 9. Each image is 28 by 28 pixels. There are 60,000 training images and 10,000 testing images. The CIFAR10 and CIFAR100 data sets (Krizhevsky et al., 2009) consist of 60,000 color images of 32 by 32 pixels with 50,000 training images and 10,000 testing images. ... The Image Net datatset (Russakovsky et al., 2015) consists of 1,431,167 color images with 1,281,167 training images, 50,000 validation images, and 100,000 testing images. |
| Hardware Specification | Yes | These performance time experiments are computed on NVIDIA Tesla V100-SXM2-32GB. |
| Software Dependencies | Yes | Experiments were run using Py Torch 3 and Tensorflow versions 1.13.1 and 2.4.1. |
| Experiment Setup | Yes | Default hyperparameters for BN and optimizers as implemented in Tensorflow or Py Torch are used, as appropriate. For BNP, the default values ϵ1 = 10 2, ϵ2 = 10 4, and ρ = 0.99 are also used. ... Each model is trained using SGD. ... We implement all parameter settings suggested in He et al. (2016a) for BN, with the exception that Preactivation Res Net-110 for CIFAR-100 follows the learning rate decay suggested in Han et al. (2016). These include weight regularization of 1E 4 and a learning rate warmup with initial learning rate 0.01 increasing to 0.1 after 400 iterations. For networks with GN, we follow Wu and He (2018) and replace all BN layers with GN. We use group size 4. For BNP+GN with CIFAR10, we use weight regularization of 1.5E 4, the He-Normal weight initialization scaled by 0.1, and group size 4 in GN. For BNP+GN with CIFAR100, we use weight regularization of 2E 4, the He-Normal weight initialization scaled by 0.4, and group size 4. GN and BNP+GN use a linear warmup schedule, with initial learning rate 0.01 increasing to 0.1 over 1 or 2 epochs, tuned for each network. For the Res Net-18 experiment with Image Net, we follow the settings of Krizhevsky et al. (2012). All images are cropped to 224 224 pixel size from each image or its horizontal flip Krizhevsky et al. (2012). All models use momentum optimizer with 0.9, weight regularization 1E 4, except BNP+GN uses 8.5E 4, a mini-batch size of 256 and train on 1 GPU. All models use an initial learning rate of 0.1 which is divided by 10 at 30, 60, and 90 epochs. Both GN and BNP+GN use groupsize 32. The best learning rates for all models are listed in Table 3. |