Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

Authors: Nikhil Iyer, V. Thejas, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, through detailed experiments that not only corroborate the generalization properties of wide minima, we also provide empirical evidence for a new hypothesis that the density of wide minima is likely lower than the density of narrow minima. Further, motivated by this hypothesis, we design a novel explore-exploit learning rate schedule. On a variety of image and natural language datasets, compared to their original hand-tuned learning rate baselines, we show that our explore-exploit schedule can result in either up to 0.84% higher absolute accuracy using the original training budget or up to 57% reduced training time while achieving the original reported accuracy.
Researcher Affiliation Industry Nikhil Iyer Microsoft Research India EMAIL Atlassian India EMAIL Nipun Kwatra Microsoft Research India EMAIL Ramachandran Ramjee Microsoft Research India EMAIL Muthian Sivathanu Microsoft Research India EMAIL
Pseudocode No The paper describes the proposed 'Knee schedule' and 'Keskar’s Sharpness Metric' through textual descriptions and mathematical formulas (e.g., Equation 1) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes 1. Source code available at: https://github.com/nikhil-iyer-97/wide-minima-density-hypothesis
Open Datasets Yes On a variety of image and natural language datasets... Image Net on Res Net-50, Cifar-10 on Res Net-18... BERTLARGE pre-training on Wikipedia+Books Corpus and fine-tuning it on SQu ADv1.1; and WMT 14 (EN-DE), IWSLT 14 (DE-EN) on Transformers.
Dataset Splits Yes We start with a series of experiments training Resnet18 on Cifar-10 over 200 epochs... We train Image Net dataset (Russakovsky et al., 2015) on Resnet-50 network (He et al., 2016)... The pre-trained models are evaluated on the SQu AD v1.1 (Rajpurkar et al., 2016) dataset by fine-tuning on the dataset for 2 epochs... In all cases we use the model checkpoint with least loss on the validation set for computing BLEU scores on the test set.
Hardware Specification Yes This corresponds to significant savings in GPU compute, e.g. savings of over 1000 V100 GPU-hours for BERTLARGE pretraining. ...BERT pre-training is extremely compute expensive and takes around 47 hours on 64 V100 GPUs (3008 V100 GPU-hrs) on cloud VMs.
Software Dependencies No The paper mentions using optimizers like SGD Momentum, Adam, RAdam, and LAMB, and refers to frameworks/implementations like fairseq, huggingface/transformers, pytorch-cifar, and imagenet18_old. However, it does not specify concrete version numbers for any of these software components.
Experiment Setup Yes We vary the number of epochs trained at a high learning rate of 0.1, called the explore epochs, from 0 to 100 and divide up the remaining epochs equally for training with learning rates of 0.01 and 0.001. ...batch size of 256 and a seed learning rate of 0.1. ...SGD optimizer with momentum of 0.9 and weight decay of 1e 4. ...batch size of 16384... RAdam (Liu et al., 2019) optimizer with β1 of 0.9 and β2 of 0.999. Label smoothed cross entropy was used as the objective function with an uncertainty of 0.1. A dropout of 0.1, clipping norm of 25 and weight decay of 1e 4 is used.