Nonconvex Stochastic Bregman Proximal Gradient Method with Application to Deep Learning

Authors: Kuangyu Ding, Jingyang Li, Kim-Chuan Toh

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on quadratic inverse problems demonstrate SBPG s robustness in terms of stepsize selection and sensitivity to the initial point. Furthermore, we introduce a momentum-based variant, MSBPG, which enhances convergence by relaxing the mini-batch size requirement while preserving the optimal oracle complexity. We apply MSBPG to the training of deep neural networks, utilizing a polynomial kernel function to ensure smooth adaptivity of the loss function. Experimental results on benchmark datasets confirm the effectiveness and robustness of MSBPG in training neural networks.
Researcher Affiliation Academia Kuangyu Ding EMAIL Department of Mathematics National University of Singapore 10 Lower Kent Ridge Road, Singapore 119076 Jingyang Li li EMAIL Department of Mathematics National University of Singapore 10 Lower Kent Ridge Road, Singapore 119076 Kim-Chuan Toh EMAIL Department of Mathematics Institute of Operations Research and Analytics National University of Singapore 10 Lower Kent Ridge Road, Singapore 119076
Pseudocode Yes Details of the implementation are provided in Algorithm 1. Algorithm 1 Momentum based Stochastic Bregman Proximal Gradient (MSBPG) for training neural networks
Open Source Code No The paper makes a statement about the promise of MSBPG as a "universal open-source optimizer for future applications" but does not explicitly state that the code for the current work is being released or provide a link.
Open Datasets Yes We conducted experiments on several representative benchmarks, including VGG16 (Simonyan and Zisserman, 2014), Res Net34 (He et al., 2016) on CIFAR10 dataset (Krizhevsky et al., 2009), Res Net34 (He et al., 2016), Dense Net121 (Huang et al., 2017) on CIFAR100 dataset (Krizhevsky et al., 2009), and LSTMs (Hochreiter and Schmidhuber, 1997) on the Penn Treebank dataset (Marcinkiewicz, 1994).
Dataset Splits Yes We used the default training hyperparameters of SGD, Adam, and Adam W in these settings (He et al., 2016; Zhuang et al., 2020; Chen et al., 2021), and set MSBPG s learning rate (initial stepsize) as 0.1, momentum coefficient β as 0.9, weight decay coefficient λ2 as 1 10 3. ... We followed the standard experimental setup for training LSTMs (Zhuang et al., 2020; Chen et al., 2021)...
Hardware Specification Yes The experiments for the quadratic inverse problem are conducted using MATLAB R2021b on a Windows workstation equipped with a 12-core Intel Xeon E5-2680 @ 2.50GHz processor and 128GB of RAM. For the deep learning experiments, we conducted the experiments using Py Torch running on a single RTX3090 GPU.
Software Dependencies Yes The experiments for the quadratic inverse problem are conducted using MATLAB R2021b on a Windows workstation... For the deep learning experiments, we conducted the experiments using Py Torch running on a single RTX3090 GPU.
Experiment Setup Yes For our experiments, we utilized two common training strategies: reducing the stepsize to 10% of its original value near the end of training (Zhuang et al., 2020; Chen et al., 2021; Luo et al., 2019), and using a cosine annealing schedule for stepsizes (Loshchilov and Hutter, 2016, 2017). ... For MSBPG, we set the learning rate to 25, 80, and 80 for 1-, 2-, and 3-layer LSTMs, respectively, with a momentum parameter β = 0.9, weight decay coefficient λ2 = 2 10 6. For the layerwise kernel function φi(Wi) = 1 r Wi r, we set r = 4 and δ = 1 10 6.