Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima

Authors: Brian Swenson, Ryan Murray, H. Vincent Poor, Soummya Kar

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points... Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation. The remainder of the paper is organized as follows. Section 2 presents the main results, reviews related literature, and introduces notation to be used in the proofs. Sections 3 8 prove the main results (see Section 2.8 for an overview of these sections and the general proof strategy). Section 9 concludes the paper.
Researcher Affiliation Academia Brian Swenson EMAIL Applied Research Laboratory Pennsylvania State University State College, PA 16801 Ryan Murray EMAIL Department of Mathematics North Carolina State University Raleigh, NC 27695 H. Vincent Poor EMAIL Department of Electrical and Computer Engineering Princeton University Princeton, NJ 08544 Soummya Kar EMAIL Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213
Pseudocode No The D-SGD algorithm is defined agentwise by the recursion xn(k + 1) = xn(k) αk fn(xn(k)) + ξn(k + 1) + βk X xℓ(k) xn(k) , (2) for n = 1, . . . , N,
Open Source Code No The paper does not provide any statement or link regarding the availability of source code for the methodology described.
Open Datasets No Concretely, suppose that Dn = {(xi, yi)}i represents a local data set collected or stored by agent n. Let ℓ( , ) denote some predefined loss function, and let h( , θ) denote a parametric hypothesis class, with parameter θ. In empirical risk minimization, the objective is to minimize the empirical risk over the data held by all agents, i.e., solve the optimization problem (x,y) S n Dn ℓ(h(x, θ), y) = min θ (x,y) Dn ℓ(h(x, θ), y), where the objective above fits the form of (1) with fn(θ) = P (x,y) Dn ℓ(h(xi, θ), yi).
Dataset Splits No The paper does not describe any experimental evaluation using specific datasets, and therefore, no information on dataset splits is provided.
Hardware Specification No The paper is theoretical in nature, focusing on mathematical proofs and convergence analysis of D-SGD. It does not describe any experiments or computations that would require specific hardware, hence no hardware specifications are provided.
Software Dependencies No The paper is purely theoretical, providing mathematical analysis and proofs. It does not describe any software implementations or experiments, thus no software dependencies with version numbers are mentioned.
Experiment Setup No The paper focuses on theoretical analysis, theorems, and proofs for distributed stochastic gradient descent. It does not present any experimental results or describe an experimental setup with hyperparameters or training configurations.