How iteration composition influences convergence and stability in deep learning

Authors: Benoit Dherin, Benny Avelin, Anders Karlsson, Hanna Mazzawi, Javier Gonzalvo, Michael Munn

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. Our experiments provide a proof of concept supporting this phenomenon.
Researcher Affiliation Collaboration Benoit Dherin EMAIL Google Research Benny Avelin EMAIL Department of Mathematics, Uppsala University Anders Karlsson EMAIL Department of Mathematics, University of Geneva and Uppsala University Hanna Mazzawi EMAIL Google Research Javier Gonzalvo EMAIL Google Research Michael Munn EMAIL Google Research
Pseudocode No The paper describes the algorithms (forward and backward SGD) conceptually and mathematically (e.g., θn = Tn Tn 1 T1(θ)) but does not provide a distinct, structured pseudocode block or algorithm listing.
Open Source Code No We defer engineering applications leveraging this phenomenon (like efficient implementations of the backward SGD) to future work, while outlining a few potential directions at the paper s conclusion.
Open Datasets Yes We trained a Res Net-18 with stochastic gradient descent and no regularization on the CIFAR-10 dataset Krizhevsky (2009). ... MLP trained on Fashion MNIST Xiao et al. (2017). ... Res Net-50 model He et al. (2016) using both forward and backward stochastic gradient descent with no regularization on the CIFAR-100 dataset Krizhevsky (2009).
Dataset Splits No The paper mentions training on datasets like CIFAR-10, Fashion MNIST, and CIFAR-100 but does not explicitly state the training, validation, or test split percentages or sample counts used for these datasets.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions using Adam W as an optimizer but does not specify any software libraries (e.g., TensorFlow, PyTorch) or their version numbers, nor any other relevant software dependencies with versions.
Experiment Setup Yes We used a learning rate of 0.025 and a batch-size of 8. ... The batch-size was set to 8 while the learning rate was 0.001. ... learning rate of 0.001 and a batch-size of 8. ... learning rate of 0.001 and a batch-size of 16. ... learning rate of 0.00025 and a batch-size of 8.