Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic

Authors: Matteo Sordello, Niccolo Dalmasso, Hangfeng He, Weijie J Su

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam. Section 4: Experiments
Researcher Affiliation Academia Matteo Sordello EMAIL Department of Statistics and Data Science Wharton School, University of Pennsylvania Philadelphia, PA, USA Niccolò Dalmasso EMAIL Department of Statistics & Data Science Carnegie Mellon University Pittsburgh, PA, USA Hangfeng He EMAIL Department of Computer and Information Science University of Pennsylvania Philadelphia, PA, USA Weijie Su EMAIL Department of Statistics and Data Science Wharton School, University of Pennsylvania Philadelphia, PA, USA
Pseudocode Yes Algorithm 1 Diagnostic(η, w, l, q, θin) Algorithm 2 Split SGD(η, w, l, q, B, t1, θ0, γ)
Open Source Code No The paper does not provide an explicit statement about the release of source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes Convolutional neural networks (CNNs). We consider a CNN with two convolutional layers and a final linear layer trained on the Fashion-MNIST dataset (Xiao et al., 2017). Residual neural networks (Res Nets) on Cifar-10. We consider a 18-layer Res Net2 and evaluate it on the CIFAR-10 dataset (Krizhevsky et al., 2009). Residual neural networks (Res Nets) on Cifar-100. To show the performance of Split SGD on a more complex classification task, we have also evaluated 18-layer Res Net on the CIFAR-100 dataset. Recurrent neural networks (RNNs). For RNNs, we evaluate a two-layer LSTM (Hochreiter & Schmidhuber, 1997) model on the Penn Treebank (Marcus et al., 1993) language modelling task.
Dataset Splits No The paper mentions several datasets (Fashion-MNIST, CIFAR-10, CIFAR-100, Penn Treebank) and specifies batch sizes (e.g., "The batch size is 64 across all models.", "batch size to 128"). However, it does not explicitly provide information on how these datasets were split into training, validation, or test sets (e.g., specific percentages, sample counts, or citations to predefined splits), which is necessary for reproduction.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments, such as GPU models, CPU types, or other detailed computing resources.
Software Dependencies No The paper mentions various algorithms and methods (e.g., SGD with momentum, Adam, FDR), but it does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes We set η {1e 2, 3e 2, 1e 1} for SGD and Split SGD, η {1e 2, 1e 1} for FDR and η {3e 4, 1e 3, 3e 3, 1e 2} for Adam. The batch size is 64 across all models. We use the initial learning rates η {1e 3, 1e 2, 1e 1} for SGD and Split SGD, η {1e 2, 1e 1} for FDR and η {3e 5, 3e 4, 3e 3} for Adam, and also consider the SGD procedure with manual decay that consists in setting η = 1e 1 and then decreasing it by a factor 10 at epoch 150 and 250. For consistency, we set the batch size to 128 across all models. For RNNs, We use η {0.1, 0.3, 1.0} for both SGD and Split SGD, η {0.1, 0.3} for FDR, η {1e 4, 3e 4, 1e 3} for Adam and also introduce Split Adam... Here we set the batch size to 20 across all models. The key parameters are t1 = 4, w = 20, l = 50 and q = 0.4. We set the decay rate to the standard value γ = 0.5... we set the relevant hyperparameters w and q to take value w = 4 and q = 0.25.