Variational Stochastic Gradient Descent for Deep Neural Networks

Authors: Anna Kuzina, Haotian Chen, Babak Esmaeili, Jakub M. Tomczak

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, we carry out experiments on two image classification datasets and four deep neural network architectures, where we show that VSGD outperforms Adam and SGD.
Researcher Affiliation Academia Haotian Chen EMAIL Department of Mathematics and Computer Science, Eindhoven University of Technology, Netherlands. Anna Kuzina EMAIL Department of Computer Science, Vrije Universiteit Amsterdam, Netherlands. Babak Esmaeili EMAIL Department of Mathematics and Computer Science, Eindhoven University of Technology, Netherlands. Jakub M. Tomczak EMAIL Department of Mathematics and Computer Science, Eindhoven University of Technology, Netherlands.
Pseudocode Yes Algorithm 1 VSGD Input: SVI learning rate parameter {κ1, κ2}, learning rate η, prior strength γ, prior variance ratio Kg. Initialize: θ0, a0,g = γ; a0,ˆg = γ; b0,g = γ; b0,ˆg = Kgγ; µ0,g = 0 for t = 1 to T do Compute ˆgt for L(θ; ) ρt,1 = t κ1 ρt,2 = t κ2 Update σ2 t,g, µt,g Eq. 14, 15 Update at,g, at,ˆg Eq. 16 Update bt,g, bt,ˆg Eq. 19,20 Update θt Eq. 23 end for
Open Source Code Yes Code is available at github.com/generativeai-tue/vsgd
Open Datasets Yes Data We used three benchmark datasets: CIFAR100 (Krizhevsky et al., 2009), Tiny Imagenet-200 (Deng et al., 2009a)2, and Imagenet-1k (Deng et al., 2009b).
Dataset Splits Yes The CIFAR100 dataset contains 60000 small (32 32) RGB images labeled into 100 different classes, 50000 images are used for training, and 10000 are left for testing. In the case of Tiny Imagenet-200, the models are trained on 100000 images from 200 different classes and tested on 10000 images.
Hardware Specification Yes Table 2: Average training time on Ge Force RTX 2080 Ti (seconds per training iteration) on CIFAR100 dataset.
Software Dependencies No The paper mentions open-source implementations for VGG, Conv Mixer, and ResNeXt, e.g., 'github.com/alecwangcq/KFAC-Pytorch/blob/master/models/cifar/vgg.py', but does not provide specific version numbers for software libraries like PyTorch itself or other dependencies.
Experiment Setup Yes Hyperparameters We conducted a grid search over the following hyperparameters: Learning rate (all optimizers); Weight decay (Adam W, VSGD); Momentum coefficient (SGD). For each set of hyperparameters, we trained the models with three different random seeds and chose the best one based on the validation dataset. The complete set of hyperparameters used in all experiments is reported in Table 3. Furthermore, we apply the learning rate scheduler, which halves the learning rate after each 10000 training iterations for CIFAR100 and every 20000 iterations for Tiny Imagenet-200. We train VGG and Conv Mixer using batch size 256 for CIFAR100 and batch size 128 for Tiny Imagenet-200. We use a smaller batch size (128 for CIFAR100 and 64 for Tiny Imagenet-200) with the Res Ne Xt architecture to fit training on a single GPU.