reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic

Authors: Matteo Sordello, Niccolo Dalmasso, Hangfeng He, Weijie J Su

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam. Section 4: Experiments
Researcher Affiliation	Academia	Matteo Sordello EMAIL Department of Statistics and Data Science Wharton School, University of Pennsylvania Philadelphia, PA, USA Niccolò Dalmasso EMAIL Department of Statistics & Data Science Carnegie Mellon University Pittsburgh, PA, USA Hangfeng He EMAIL Department of Computer and Information Science University of Pennsylvania Philadelphia, PA, USA Weijie Su EMAIL Department of Statistics and Data Science Wharton School, University of Pennsylvania Philadelphia, PA, USA
Pseudocode	Yes	Algorithm 1 Diagnostic(η, w, l, q, θin) Algorithm 2 Split SGD(η, w, l, q, B, t1, θ0, γ)
Open Source Code	No	The paper does not provide an explicit statement about the release of source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	Yes	Convolutional neural networks (CNNs). We consider a CNN with two convolutional layers and a final linear layer trained on the Fashion-MNIST dataset (Xiao et al., 2017). Residual neural networks (Res Nets) on Cifar-10. We consider a 18-layer Res Net2 and evaluate it on the CIFAR-10 dataset (Krizhevsky et al., 2009). Residual neural networks (Res Nets) on Cifar-100. To show the performance of Split SGD on a more complex classification task, we have also evaluated 18-layer Res Net on the CIFAR-100 dataset. Recurrent neural networks (RNNs). For RNNs, we evaluate a two-layer LSTM (Hochreiter & Schmidhuber, 1997) model on the Penn Treebank (Marcus et al., 1993) language modelling task.
Dataset Splits	No	The paper mentions several datasets (Fashion-MNIST, CIFAR-10, CIFAR-100, Penn Treebank) and specifies batch sizes (e.g., "The batch size is 64 across all models.", "batch size to 128"). However, it does not explicitly provide information on how these datasets were split into training, validation, or test sets (e.g., specific percentages, sample counts, or citations to predefined splits), which is necessary for reproduction.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments, such as GPU models, CPU types, or other detailed computing resources.
Software Dependencies	No	The paper mentions various algorithms and methods (e.g., SGD with momentum, Adam, FDR), but it does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup	Yes	We set η {1e 2, 3e 2, 1e 1} for SGD and Split SGD, η {1e 2, 1e 1} for FDR and η {3e 4, 1e 3, 3e 3, 1e 2} for Adam. The batch size is 64 across all models. We use the initial learning rates η {1e 3, 1e 2, 1e 1} for SGD and Split SGD, η {1e 2, 1e 1} for FDR and η {3e 5, 3e 4, 3e 3} for Adam, and also consider the SGD procedure with manual decay that consists in setting η = 1e 1 and then decreasing it by a factor 10 at epoch 150 and 250. For consistency, we set the batch size to 128 across all models. For RNNs, We use η {0.1, 0.3, 1.0} for both SGD and Split SGD, η {0.1, 0.3} for FDR, η {1e 4, 3e 4, 1e 3} for Adam and also introduce Split Adam... Here we set the batch size to 20 across all models. The key parameters are t1 = 4, w = 20, l = 50 and q = 0.4. We set the decay rate to the standard value γ = 0.5... we set the relevant hyperparameters w and q to take value w = 4 and q = 0.25.