reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

Authors: Nikhil Iyer, V. Thejas, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, through detailed experiments that not only corroborate the generalization properties of wide minima, we also provide empirical evidence for a new hypothesis that the density of wide minima is likely lower than the density of narrow minima. Further, motivated by this hypothesis, we design a novel explore-exploit learning rate schedule. On a variety of image and natural language datasets, compared to their original hand-tuned learning rate baselines, we show that our explore-exploit schedule can result in either up to 0.84% higher absolute accuracy using the original training budget or up to 57% reduced training time while achieving the original reported accuracy.
Researcher Affiliation	Industry	Nikhil Iyer Microsoft Research India EMAIL Atlassian India EMAIL Nipun Kwatra Microsoft Research India EMAIL Ramachandran Ramjee Microsoft Research India EMAIL Muthian Sivathanu Microsoft Research India EMAIL
Pseudocode	No	The paper describes the proposed 'Knee schedule' and 'Keskar’s Sharpness Metric' through textual descriptions and mathematical formulas (e.g., Equation 1) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	1. Source code available at: https://github.com/nikhil-iyer-97/wide-minima-density-hypothesis
Open Datasets	Yes	On a variety of image and natural language datasets... Image Net on Res Net-50, Cifar-10 on Res Net-18... BERTLARGE pre-training on Wikipedia+Books Corpus and ﬁne-tuning it on SQu ADv1.1; and WMT 14 (EN-DE), IWSLT 14 (DE-EN) on Transformers.
Dataset Splits	Yes	We start with a series of experiments training Resnet18 on Cifar-10 over 200 epochs... We train Image Net dataset (Russakovsky et al., 2015) on Resnet-50 network (He et al., 2016)... The pre-trained models are evaluated on the SQu AD v1.1 (Rajpurkar et al., 2016) dataset by ﬁne-tuning on the dataset for 2 epochs... In all cases we use the model checkpoint with least loss on the validation set for computing BLEU scores on the test set.
Hardware Specification	Yes	This corresponds to signiﬁcant savings in GPU compute, e.g. savings of over 1000 V100 GPU-hours for BERTLARGE pretraining. ...BERT pre-training is extremely compute expensive and takes around 47 hours on 64 V100 GPUs (3008 V100 GPU-hrs) on cloud VMs.
Software Dependencies	No	The paper mentions using optimizers like SGD Momentum, Adam, RAdam, and LAMB, and refers to frameworks/implementations like fairseq, huggingface/transformers, pytorch-cifar, and imagenet18_old. However, it does not specify concrete version numbers for any of these software components.
Experiment Setup	Yes	We vary the number of epochs trained at a high learning rate of 0.1, called the explore epochs, from 0 to 100 and divide up the remaining epochs equally for training with learning rates of 0.01 and 0.001. ...batch size of 256 and a seed learning rate of 0.1. ...SGD optimizer with momentum of 0.9 and weight decay of 1e 4. ...batch size of 16384... RAdam (Liu et al., 2019) optimizer with β1 of 0.9 and β2 of 0.999. Label smoothed cross entropy was used as the objective function with an uncertainty of 0.1. A dropout of 0.1, clipping norm of 25 and weight decay of 1e 4 is used.