Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization

Authors: Yufeng Yang, Erin E. Tripp, Yifan Sun, Shaofeng Zou, Yi Zhou

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on large-scale nonconvex generalized-smooth problems demonstrate the fast convergence of our algorithm. We compare the numerical performance of our IAN-SGD algorithm with other state-of-the-art stochastic algorithms in applications of nonconvex phase retrieval, distributionally-robust optimization and training deep neural networks, all of which are generalized-smooth nonconvex problems.
Researcher Affiliation Academia Yufeng Yang EMAIL Department of Computer Science and Engineering Texas A& M University College Station, TX 77843, USA Erin E. Tripp EMAIL Mathematics and Statistics Department Hamilton College Clinton, NY 13323, USA Yifan Sun EMAIL Department of Computer Science Stony Brook University Stony Brook, NY 11794, USA Shaofeng Zou EMAIL School of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ 85287, USA Yi Zhou EMAIL Department of Computer Science and Engineering Texas A&M University College Station, TX 77843, USA
Pseudocode No The paper describes the IAN-SGD algorithm using a mathematical formula: '(IAN-SGD): wt+1 = wt - γ fξ(wt) / ht, where ht = max {1, Γ / (A ||fξ'(wt)|| + δ)}.' This is a mathematical definition rather than a structured pseudocode or algorithm block with numbered steps or similar formatting.
Open Source Code Yes Code available at github.com/ynyang94/Gensmooth-IAN-SGD
Open Datasets Yes In this experiment, we use the Life Expectancy data (Arshi, 2017) for regression task. ... we train Res Net18 and Res Net50 (He et al., 2016) on CIFAR10 Dataset (Krizhevsky, 2009) from scratch.
Dataset Splits No The paper mentions using specific batch sizes for training, e.g., 'We set batch size |B| = 64, and for IAN-SGD, we choose a small independent batch size |B'| = 4.' and 'For batch size, all algorithms use B = 128, and B' = 32 for IAN-SGD.' It also states 'We trained Res Net18, Res Net50 on CIFAR10 dataset for 60, 80 epochs.' While standard datasets like CIFAR10 typically have predefined splits, the paper does not explicitly state the train/validation/test splits used for any of the experiments or cite a specific split methodology.
Hardware Specification Yes All experiments were conducted on a PC computer equipped with a 24-core CPU, 32GB of RAM, and a single NVIDIA RTX4090 GPU, running Python 3.8.
Software Dependencies No The paper states 'running Python 3.8.' and 'For SGD, Adam and Adagrad, we utilize Py Torch built-in optimizer (Paszke et al., 2019) to implement training pipelines.' While Python 3.8 provides a specific version for the programming language, it does not list specific version numbers for other key software components like PyTorch, which is used.
Experiment Setup Yes For the stochastic momentum moving average parameter used for acceleration methods, we set it as 0.1 and 0.25 for NSGD with momentum and SPIDER respectively. For stochastic algorithms without usage of multiple mini-batches, i.e., SGD, NSGD, NSGD with momentum and clipped SGD, we set their batch sizes as |B| = 128. For SPIDER, we set |B| = 128 and |B'| = 2313, where the algorithm will conduct a full-gradient computation after every 15 iterations. For IAN-SGD, we set the batch size for two batch samples as |B| = 128 and |B'| = 8. We fine-tuned the learning rate for each algorithm, i.e., γ = 4e-5 for SGD, γ = 5e-3 for NSGD, NSGD with momentum and SPIDER, γ = 0.11 for clipped SGD and IAN-SGD. We set the δ = 1e-1, maximal gradient clipping constant as 30, 25 for clipped SGD and IAN-SGD respectively. And we set normalization parameter β = 2/3.