Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

Authors: David Peer, Bart Keulen, Sebastian Stabinger, Justus Piater, Antonio Rodriguez-sanchez

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove empirically and theoretically that a positive batch-entropy is required for gradient descent-based training approaches to optimize a given loss function successfully. Based on those insights, we introduce batch-entropy regularization to enable gradient descent-based training algorithms to optimize the flow of information through each hidden layer individually. We show empirically that we can therefore train a "vanilla" fully connected network and convolutional neural network... The effect of batch-entropy regularization is not only evaluated on vanilla neural networks, but also on residual networks, autoencoders, and also transformer models over a wide range of computer vision as well as natural language processing tasks.
Researcher Affiliation Collaboration David Peer EMAIL Deep Opinion University of Innsbruck, Austria Bart Keulen EMAIL University of Innsbruck, Austria Sebastian Stabinger EMAIL Deep Opinion University of Innsbruck, Austria Justus Piater EMAIL University of Innsbruck, Austria Antonio Rodríguez-Sánchez EMAIL University of Innsbruck, Austria
Pseudocode No The paper does not contain explicitly labeled pseudocode or algorithm blocks. Section 3.1 describes notation and mathematical formulations, but these are not presented as structured algorithms.
Open Source Code Yes Our code base is available at https://github.com/peerdavid/layerwise-batch-entropy. ... Our source code is publicly available on Git Hub2, including sweep files for each experiment which describe precise hyperparameter ranges that we used in order to reproduce all experiments.
Open Datasets Yes In total we used 8 different datasets for evaluation, including computer vision as well as natural language processing tasks: MNIST (Le Cun & Cortes, 2010), Fashion MNIST (Xiao et al., 2017), CIFAR10 and CIFAR100 (Krizhevsky et al., 2009), RTE (Bentivogli et al., 2009), MRPC (Dolan & Brockett, 2005) and Co LA (Warstadt et al., 2018). To also evaluate the proposed approach on a more challenging task, while still using only moderate computational resources, we trained a model on a subset of Image Net (100 classes) but used only 500 images per class to increase the complexity of the task (Deng et al., 2009). We call this dataset Image Net100 in this paper.
Dataset Splits Yes We split the training set into training (80%) and validation (20%) and use the validation set to find a good hyperparameter setup for each method. The performance results, comparing models that are trained with / without LBE, is provided on the test-set to ensure that we do not overfit the validation set through hyperparameter search.
Hardware Specification No All experiments are implemented in Py Torch (Paszke et al., 2017) and executed on Nvidia GPUs. While it mentions 'Nvidia GPUs', it does not provide specific models or further details about the hardware.
Software Dependencies No All experiments are implemented in Py Torch (Paszke et al., 2017) and executed on Nvidia GPUs. Wandb was used for experiment tracking (Biewald, 2020). The paper mentions PyTorch and Wandb but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Section 4.3: All networks contain 1000 neurons per layer that are trained for 100 epochs, with a batch size of 512, using the Adam optimizer (Kingma & Ba, 2015). To find a good setup for the baseline, we optimized for the learning rate lr [1e 4, 5e 4, 1e 3] utilizing grid-search. Section 4.4: All networks were trained for 50 epochs with a batch size of 128. The Adam (Kingma & Ba, 2015) optimizer with a waterfall schedule was used. An initial learning rate of 1e 3 was used and every time the validation error stopped decreasing...the learning rate was scaled down by a factor of 0.5. Section 4.5: We used a learning rate of 0.1 at the beginning of the training and divide it by 10 whenever it plateaus, together with a weight decay of 0.0001 and a momentum of 0.9. The network is trained on four different datasets...using SGD with a mini-batch size of 256. Section 4.6: ...fine-tuned networks with a batch-size of 32 and a learning rate that we found through a grid-search [1e 5, 3e 5, 5e 5].