Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima
Authors: Brian Swenson, Ryan Murray, H. Vincent Poor, Soummya Kar
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points... Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation. The remainder of the paper is organized as follows. Section 2 presents the main results, reviews related literature, and introduces notation to be used in the proofs. Sections 3 8 prove the main results (see Section 2.8 for an overview of these sections and the general proof strategy). Section 9 concludes the paper. |
| Researcher Affiliation | Academia | Brian Swenson EMAIL Applied Research Laboratory Pennsylvania State University State College, PA 16801 Ryan Murray EMAIL Department of Mathematics North Carolina State University Raleigh, NC 27695 H. Vincent Poor EMAIL Department of Electrical and Computer Engineering Princeton University Princeton, NJ 08544 Soummya Kar EMAIL Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 |
| Pseudocode | No | The D-SGD algorithm is defined agentwise by the recursion xn(k + 1) = xn(k) αk fn(xn(k)) + ξn(k + 1) + βk X xℓ(k) xn(k) , (2) for n = 1, . . . , N, |
| Open Source Code | No | The paper does not provide any statement or link regarding the availability of source code for the methodology described. |
| Open Datasets | No | Concretely, suppose that Dn = {(xi, yi)}i represents a local data set collected or stored by agent n. Let ℓ( , ) denote some predefined loss function, and let h( , θ) denote a parametric hypothesis class, with parameter θ. In empirical risk minimization, the objective is to minimize the empirical risk over the data held by all agents, i.e., solve the optimization problem (x,y) S n Dn ℓ(h(x, θ), y) = min θ (x,y) Dn ℓ(h(x, θ), y), where the objective above fits the form of (1) with fn(θ) = P (x,y) Dn ℓ(h(xi, θ), yi). |
| Dataset Splits | No | The paper does not describe any experimental evaluation using specific datasets, and therefore, no information on dataset splits is provided. |
| Hardware Specification | No | The paper is theoretical in nature, focusing on mathematical proofs and convergence analysis of D-SGD. It does not describe any experiments or computations that would require specific hardware, hence no hardware specifications are provided. |
| Software Dependencies | No | The paper is purely theoretical, providing mathematical analysis and proofs. It does not describe any software implementations or experiments, thus no software dependencies with version numbers are mentioned. |
| Experiment Setup | No | The paper focuses on theoretical analysis, theorems, and proofs for distributed stochastic gradient descent. It does not present any experimental results or describe an experimental setup with hyperparameters or training configurations. |