Kernel Normalized Convolutional Networks
Authors: Reza Nasirigerdeh, Reihaneh Torkzadehmahani, Daniel Rueckert, Georgios Kaissis
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we illustrate that KNConv Nets achieve higher or competitive performance compared to the Batch Norm counterparts in image classification and semantic segmentation. They also significantly outperform their batch-independent competitors including those based on layer and group normalization in non-private and differentially private training. According to the experimental results (Section 4), KNRes Nets deliver significantly higher accuracy than the Batch Norm counterparts in image classification on CIFAR-100 using a small batch size. KNRes Nets, moreover, achieve higher or competitive performance compared to the batch normalized Res Nets in classification on Image Net and semantic segmentation on City Scapes. |
| Researcher Affiliation | Academia | Reza Nasirigerdeh EMAIL Technical University of Munich Helmholtz Munich Reihaneh Torkzadehmahani EMAIL Technical University of Munich Daniel Rueckert EMAIL Technical University of Munich Imperial College London Georgios Kaissis EMAIL Technical University of Munich Helmholtz Munich |
| Pseudocode | Yes | Algorithm 1: Computationally-efficient KNConv layer Input: input tensor X, number of input channels chin, number of output channels chout, kernel size (kh, kw), stride (sh, sw), padding (ph, pw), bias flag, dropout probability p, and epsilon ϵ // 2-dimensional convolutional layer conv_layer = Conv2d(in_channels=chin, out_channels=chout, kernel_size=(kh, kw), stride=(sh, sw), padding=(ph, pw), bias=false) // convolutional layer output conv_out = conv_layer(input=X) // mean and variance from Kernel Norm µ, σ2 = kn_mean_var(input=X, kernel_size=(kh, kw), stride=(sh, sw), padding=(ph, pw), dropout_p=p) // KNConv output kn_conv_out = (conv_out µ P conv_layer.weights) / σ2 + ϵ // apply bias if bias then kn_conv_out += conv_layer.bias Output: kn_conv_out |
| Open Source Code | Yes | 1The code is available at: https://github.com/reza-nasirigerdeh/norm-torch |
| Open Datasets | Yes | Our last contribution is to draw performance comparisons among KNRes Nets and the competitors using several benchmark datasets including CIFAR-100 (Krizhevsky et al., 2009), Image Net (Deng et al., 2009), and Cityscapes (Cordts et al., 2016). Dataset. The CIFAR-100 dataset consists of 50000 train and 10000 test samples of shape 32 32 from 100 classes. Dataset. The Image Net dataset contains around 1.28 million training and 50000 validation images. Dataset. The City Scapes dataset contains 2975 train and 500 validation images from 30 classes, 19 of which are employed for evaluation. Dataset. Image Net32 32 is the down-sampled version of Image Net, where all images are resized to 32 32. |
| Dataset Splits | Yes | Dataset. The CIFAR-100 dataset consists of 50000 train and 10000 test samples of shape 32 32 from 100 classes. We adopt the data preprocessing and augmentation scheme widely used for the dataset (Huang et al., 2017a; He et al., 2016b;a): Horizontally flipping and randomly cropping the samples after padding them. The cropping and padding sizes are 32 32 and 4 4, respectively. Dataset. The Image Net dataset contains around 1.28 million training and 50000 validation images. Following the data preprocessing and augmentation scheme from Torch Vision (2023a), the train images are horizontally flipped and randomly cropped to 224 224. The test images are first resized to 256 256, and then center cropped to 224 224. Dataset. The City Scapes dataset contains 2975 train and 500 validation images from 30 classes, 19 of which are employed for evaluation. Following Sun et al. (2019); Ortiz et al. (2020), the train samples are randomly cropped from 2048 1024 to 1024 512, horizontally flipped, and randomly scaled in the range of [0.5, 2.0]. The models are tested on the validation images, which are of shape 2048 1024. |
| Hardware Specification | Yes | The experiments are conducted with 8 NVIDIA A40 GPUs with batch size of 32 per GPU; m: minutes, s: seconds. The experiments are conducted with a single NVIDIA RTX A6000 GPU with batch size of 32; GB: Gigabytes. |
| Software Dependencies | Yes | In terms of implementation, Kernel Norm employs the unfolding operation in Py Torch (2023b) to implement the sliding window mechanism in the kn_mean_var function in Algorithm 1. Moreover, it uses the var_mean function in Py Torch (2023c) to compute the mean and variance over the unfolded tensor along the channel, width, and height dimensions. We adopt the original implementation of Res Net-18/34/50 from Py Torch (Paszke et al., 2019), and the Preact Res Net-18/34/50 (He et al., 2016b) implementation from Kuang (2021). We follow the experimental setting from Wu & He (2018) and use the multi-GPU training script from Torch Vision (2023a) to train KNRes Nets and the competitors. Our differentially private training is based on DP-SGD (Abadi et al., 2016) from the Opacus library (Yousefpour et al., 2021) with ε=8.0 and δ=8 10 7. We explore the effectiveness of Kernel Norm on the Conv Next architecture (Liu et al., 2022) in addition to Res Nets. Conv Next is a convolutional architecture, but it is heavily inspired by vision transformers (Dosovitskiy et al., 2020), where it uses linear (fully-connected) layers extensively and employs Layer Norm as the normalization layer instead of Batch Norm. To draw the comparison, we train the original Conv Next Tiny model from Py Torch and the corresponding kernel normalized version (both with around 28.5m parameters) on Image Net using the training recipe and code from Torch Vision (2023b) (more experimental details in Appendix B). |
| Experiment Setup | Yes | Training. The models are trained for 150 epochs using the cosine annealing scheduler (Loshchilov & Hutter, 2017) with learning rate decay of 0.01. The optimizer is SGD with momentum of 0.9 and weight decay of 0.0005. For learning rate tuning, we run a given experiment with initial learning rate of 0.2, divide it by 2, and re-run the experiment. We continue this procedure until finding the best learning rate (Table 5 in Appendix B). Then, we repeat the experiment three times, and report the mean and SD over the runs. Training. We follow the experimental setting from Wu & He (2018) and use the multi-GPU training script from Torch Vision (2023a) to train KNRes Nets and the competitors. We train all models for 100 epochs with total batch size of 256 (8 GPUs with batch size of 32 per GPU) using learning rate of 0.1, which is divided by 10 at epochs 30, 60, and 90. The optimizer is SGD with momentum of 0.9 and weight decay of 0.0001. Training. Following Sun et al. (2019); Ortiz et al. (2020), we train the models with learning rate of 0.01, which is gradually decayed by power of 0.9. The models are trained for 500 epochs using 2 GPUs with batch size of 8 per GPU. The optimizer is SGD with momentum of 0.9 and weight decay of 0.0005. Training. We train KNRes Net-18 as well as the Group Norm and Layer Norm counterparts for 100 epochs using the SGD optimizer with zero-momentum and zero-weight decay, where the learning rate is decayed by factor of 2 at epochs 70, and 90. Note that Batch Norm is inapplicable to differential privacy. All models use the Mish activation (Misra, 2019). For parameter tuning, we consider learning rate values of {2.0, 3.0, 4.0}, clipping values of {1.0, 2.0}, and batch sizes of {2048, 4096, 8192}. We observe that learning rate of 4.0, clipping value of 2.0, and batch size of 8192 achieve the best performance for all models. Our differentially private training is based on DP-SGD (Abadi et al., 2016) from the Opacus library (Yousefpour et al., 2021) with ε=8.0 and δ=8 10 7. The privacy accountant is RDP (Mironov, 2017) |