reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Implicit Regularization of AdaDelta

Authors: Matthias Englert, Ranko Lazic, Avi Semler

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we corroborate our theoretical results by numerical experiments on convolutional networks with MNIST and CIFAR-10 datasets.
Researcher Affiliation	Academia	Matthias Englert EMAIL University of Warwick Ranko Lazić EMAIL University of Warwick Avi Semler EMAIL University of Warwick
Pseudocode	Yes	Procedure 1 Discrete generalized Ada Delta.
Open Source Code	No	Ada Delta is one of the main adaptive optimization algorithms, implemented in Py Torch3, and known to perform well in many circumstances compared to other algorithms including RMSProp and Adam (cf. e.g. Ruder (2016)). Specifically: ... 3https://pytorch.org/docs/stable/generated/torch.optim.Adadelta.html ... SGD: Stochastic gradient descent as implemented in Py Torch5. ... 5https://pytorch.org/docs/stable/generated/torch.optim.SGD.html Explanation: The paper refers to standard PyTorch implementations for algorithms like Ada Delta and SGD, and provides links to their official documentation. It does not state that the authors' own implementation or experimental code is openly available.
Open Datasets	Yes	Finally, we corroborate our theoretical results by numerical experiments on convolutional networks with MNIST and CIFAR-10 datasets. ... We trained the network on MNIST (Le Cun, Bottou, Bengio, and Haffner, 1998) from the default Py Torch random initialization for 500 epochs using the cross-entropy loss ... We trained the network on CIFAR-10 (Krizhevsky, 2009) from the default Py Torch random initialization for 1000 epochs using the cross-entropy loss
Dataset Splits	No	We trained the network on MNIST (Le Cun, Bottou, Bengio, and Haffner, 1998) from the default Py Torch random initialization for 500 epochs using the cross-entropy loss: in a finer regime with batch size 100, learning rate 0.01 for SGD, and learning rate 0.1 for the four variants of Ada Delta; and in a coarser regime with batch size 1000, learning rate 0.1 for SGD, and learning rate 1 for the four variants of Ada Delta. ... We trained the network on CIFAR-10 (Krizhevsky, 2009) from the default Py Torch random initialization for 1000 epochs using the cross-entropy loss: in a finer regime with batch size 100, and learning rate 0.1 for all five algorithms; and in a coarser regime with batch size 250, and learning rate 0.25 for all five algorithms. Explanation: The paper mentions using standard datasets like MNIST and CIFAR-10 but does not explicitly state the training, validation, or test splits used for these datasets, nor does it refer to using their standard splits.
Hardware Specification	No	The total compute for the perceptron setting was around 10min on a mid-range CPU; whereas one run of all five algorithms for the smaller convolutional setting took roughly 2h, and for the larger convolutional setting took roughly 12h, in both cases on a mid-range GPU. Explanation: The paper only vaguely describes the hardware as 'mid-range CPU' and 'mid-range GPU' without specifying any particular models or specifications.
Software Dependencies	No	Ada Delta is one of the main adaptive optimization algorithms, implemented in Py Torch3, ... 3https://pytorch.org/docs/stable/generated/torch.optim.Adadelta.html ... SGD: Stochastic gradient descent as implemented in Py Torch5. ... 5https://pytorch.org/docs/stable/generated/torch.optim.SGD.html Explanation: The paper mentions 'Py Torch' as the software used but does not provide specific version numbers for PyTorch or any other key libraries.
Experiment Setup	Yes	We trained the network for K = 5000 full-batch epochs using the exponential loss, with learning rate 0.1 for SGD and learning rate 1 for the four variants of Ada Delta. ... in a finer regime with batch size 100, learning rate 0.01 for SGD, and learning rate 0.1 for the four variants of Ada Delta; and in a coarser regime with batch size 1000, learning rate 0.1 for SGD, and learning rate 1 for the four variants of Ada Delta. ... in a finer regime with batch size 100, and learning rate 0.1 for all five algorithms; and in a coarser regime with batch size 250, and learning rate 0.25 for all five algorithms.