Zero-Order One-Point Gradient Estimate in Consensus-Based Distributed Stochastic Optimization

Authors: Elissa Mhanna, Mohamad Assaad

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we provide numerical examples to illustrate the performance of the algorithm 1P-DSG. We compare it with FO distributed methods aiming to achieve consensus, namely DSGT (Pu & Nedić, 2018) and EXTRA (Shi et al., 2015), a ZO distributed algorithm denoted 2P-DSG based on the two-point estimate in (2) (Tang et al., 2021), and a ZO centralized algorithm based on gradient descent (e.g. Flaxman et al. (2004) and Bach & Perchet (2016)) using another one-point estimate which is presented in (1). We denote the ZO centralized algorithm by 1P-GD. ... We consider a logistic classification problem to classify m images of the two digits, labeled as yij = +1 or 1 from the MNIST data set (Le Cun & Cortes, 2005).
Researcher Affiliation Academia Elissa Mhanna EMAIL Mohamad Assaad EMAIL Laboratoire des Signaux & Systèmes Centrale Supélec 3 rue Joliot Curie 91190 Gif-sur-Yvette, France
Pseudocode Yes 2.1 The 1P-DSG Algorithm We consider a zero-order distributed stochastic gradient algorithm aiming for consensus with a one-point estimate. We denote it as 1P-DSG employing the gradient estimate gi,k in (6). Every agent i initializes its variables with an arbitrary valued vector xi,0 K and computes gi,0 at that variable. Then, at each time k N, agent i updates its variables independently according to the following steps: j=1 wij(xj,k αkgj,k) xi,k+1 = ΠK(zi,k+1) perform the action: xi,k+1 + γk+1Φi,k+1 where αk > 0 is a step size. Algorithm (7) can then be written in the following compact matrix form for clarity of analysis: zk+1 = W(xk αkgk) xk+1 = [x1,k+1, x2,k+1, . . . , xn,k+1]T perform the action: xk+1 + γk+1Φk+1 where Φk Rn d is defined as Φk = [Φ1,k, Φ2,k, . . . , Φn,k]T .
Open Source Code No No explicit statement regarding the release of source code for this paper's methodology was found.
Open Datasets Yes We consider a logistic classification problem to classify m images of the two digits, labeled as yij = +1 or 1 from the MNIST data set (Le Cun & Cortes, 2005).
Dataset Splits Yes In Figure 4, we measure at every iteration the classification accuracy against an independent test set of 2167 images using the updated mean vector θk = 1/n Pn i=1 θi,k of the local decision variables.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes We take σu = 0.01 and let αk = 0.05(k + 1) 0.75 and γk = 0.8(k + 1) 0.25 for 1P-DSG with vanishing step sizes, and α = 0.05 and γ = 0.6 with constant step sizes. We choose Φk { 1 d}d with equal probability. Also, every function query is subject to a white noise generated by the standard normal distribution. For the DSGT algorithm, we set the step size to αk = 0.015(k + 1) 1 when it is vanishing and α = 0.015 when constant, and we do not consider the perturbation on the objective function nor the noise on the objective function, only the noise on the exact gradient. Similarly for EXTRA and we set its step size to α = 0.01. For 2P-DSG, we consider αk = 0.01(k + 1) 0.75 and γk = 0.01(k + 1) 0.25. For the centralized 1P-GD algorithm, we set α = 0.005 and γ = 0.5 (α = 0.03 and γ = 0.6 with estimator (6)). We let c = 0.1, and the initialization be the same for all algorithms, with θi,0 uniformly chosen from [ 0.5, 0.5]d, i N, per instance. We finally average the simulations over 30 instances.