Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization

Authors: Zhize Li, Jian Li

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called Prox SVRG+. We provide a clean and tight analysis of Prox SVRG+, which shows that it outperforms the deterministic proximal gradient descent (Prox GD) for a wide range of minibatch sizes, hence solves an open problem proposed in Reddi et al. (2016b). Also, Prox SVRG+ uses much less proximal oracle calls than Prox SVRG (Reddi et al., 2016b) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on SARAH (Nguyen et al., 2017) and show that SSRGD further improves the gradient complexity of Prox SVRG+ and achieves the the optimal upper bound, matching the known lower bound of (Fang et al., 2018; Li et al., 2021). Moreover, we show that both Prox SVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-Łojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work Prox SVRG (Reddi et al., 2016b). Finally, we focus on the more challenging problem of finding an (ϵ, δ)-local minimum instead of just finding an ϵ-approximate (first-order) stationary point (which may be some bad unstable saddle points). We show that SSRGD can find an (ϵ, δ)-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates.
Researcher Affiliation Academia Zhize Li EMAIL Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213, USA Jian Li EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University Beijing 100084, China
Pseudocode Yes Algorithm 1: Prox SVRG+ Algorithm 2: SSRGD Algorithm 3: SSRGD (full version for finding approximate local minima)
Open Source Code No The paper does not contain any explicit statements about the release of source code for the methodology described, nor does it provide links to a code repository. The license mentioned ('License: CC-BY 4.0') refers to the publication license for the paper itself, not an open-source code release.
Open Datasets No The paper is theoretical and focuses on optimization algorithms for problems like "nonsmooth (composite) nonconvex optimization problems" and "standard empirical risk minimization problems." It does not describe experiments using specific datasets, nor does it provide any links, DOIs, or citations for publicly available datasets that were used or are being made available.
Dataset Splits No The paper is purely theoretical, analyzing optimization algorithms and their convergence properties. It does not perform experiments on specific datasets, and therefore, no information regarding dataset splits (e.g., training, testing, validation) is provided.
Hardware Specification No The paper is a theoretical work focusing on the proposal and analysis of stochastic gradient algorithms. It does not report on any empirical experiments or computational evaluations that would require specific hardware, and thus, no hardware specifications are mentioned.
Software Dependencies No The paper presents theoretical algorithms and their mathematical analysis. It does not describe any implementation details, software packages, or libraries with specific version numbers that were used in experiments.
Experiment Setup No This paper is theoretical in nature, focusing on the derivation and analysis of optimization algorithms rather than empirical evaluation. Consequently, it does not provide details on experimental setups, such as specific hyperparameter values, learning rates, batch sizes, or model initialization strategies.