Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks

Authors: Alireza Fallah, Mert Gürbüzbalaban, Asuman Ozdaglar, Umut Şimşekli, Lingjiong Zhu

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct several experiments to validate our theory and assess the performance of D-SG and D-ASG. We consider a (regularized) logistic regression problem which is a common formulation to solve binary classification tasks: i=1 log 1 + exp yi X i x + λ x 2 2 where (Xi, yi) denotes a data pair: Xi Rd is the feature vector and yi { 1, 1} denotes the label, and n denotes the number of data pairs. In all our experiments, we assume that each computation node has access to a subset of all data points, and the noisy gradient in (6) and (13) basically becomes the stochastic gradient that is computed on a random sub-sample of the data. More precisely, we will assume that at each iteration, each computation node will draw a random sub-sample from the data points that it has access to, and compute the stochastic gradient by using this subsample.
Researcher Affiliation Academia Alireza Fallah* EMAIL Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Cambridge, MA 02139, United States of America Mert G urb uzbalaban* EMAIL Department of Management Science and Information Systems Departments of Electrical Engineering and Statistics Rutgers University Piscataway, NJ 08854, United States of America Asuman Ozdaglar* EMAIL Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Cambridge, MA 02139, United States of America Umut S im sekli* EMAIL INRIA D epartement d Informatique de l Ecole Normale Sup erieure PSL Research University Paris, France Lingjiong Zhu* EMAIL Department of Mathematics Florida State University Tallahassee, FL 32306, United States of America
Pseudocode Yes Algorithm 1: Distributed Multistage Accelerated Stochastic Gradient Algorithm (D-MASG) Input : Initial iterate x(0), The sequence {αi}T i=1 of stepsizes, The sequence {ki}T i=1 of length of stages. Set x(0,k0) = x(0); for t = 1; t T; t = t + 1 do Set x(t, 1) = x(t,0) = x(t 1,kt 1); for m = 0; m kt 1; m = m + 1 do Set βt = 1 µαt 1+ µαt; Set y(t,m) = (1 + βt)x(t,m) βtx(t,m 1); Set x(t,m+1) = Wy(t,m) αt f y(t,m)
Open Source Code Yes In this set of experiments, we simulate the distributed environment in MATLAB and we provide our implementation in the supplementary material.
Open Datasets Yes The MNIST data set contains 70K binary images (of size d = 20 20) corresponding to 10 different digits.7 To obtain a binary classification problem, we extract the images corresponding to the digits 0 and 8, where we end up with n = 11774 images in total. On the other hand, the Epsilon data set is one of the standard binary classification data sets,8 and contains n = 400K samples with d = 2000. 7. http://yann.lecun.com/exdb/mnist. 8. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.
Dataset Splits No The paper mentions a "batch proportion b = 0.1" used for computing stochastic gradients, but this refers to sub-sampling for gradient estimation, not explicit training/validation/test dataset splits. There is no mention of specific percentages, counts, or methodologies for partitioning the datasets into training, validation, or test sets.
Hardware Specification No We have implemented all the algorithms in C++ by using a low-level message passing protocol for parallel processing, namely the Open MPI library.9 In order to have a realistic experimental environment, we have conducted these experiments on a cluster interconnected computers, each of which is equipped with different quality CPUs and memories.
Software Dependencies No In this set of experiments, we simulate the distributed environment in MATLAB and we provide our implementation in the supplementary material. We have implemented all the algorithms in C++ by using a low-level message passing protocol for parallel processing, namely the Open MPI library.9 Footnote 9: https://www.open-mpi.org. The paper mentions MATLAB, C++, and the Open MPI library, but it does not specify version numbers for any of these software components.
Experiment Setup Yes Unless stated otherwise, we first generate n = 1000 data points, set the dimension d = 100, data variance σ2 X = 5, λ = 0.05, the number of nodes N = 10, the batch proportion b = 0.1, α = 0.1, and we consider the circular network architecture. In this problem, we first compute an estimate for L and µ from the data, where we obtain L ≈ 50 and µ ≈ 0.1. Then, we vary L from 5 to 500 by fixing µ = 0.1, and we vary µ from 0.01 to 1 by fixing L = 50. Accordingly, we run the algorithms with hyperparameters that are computed with these values for L and µ. For D-SG we set the stepsize α = (1 + λW N )/L, whereas for D-ASG we set α = λW N /L and β = 1 αµ 1+ αµ. Finally, for D-MASG we set κ = L+µ µλW N as in (38) and use the setting reported in Proposition 14.