Distributed Proximal Gradient Algorithm for Partially Asynchronous Computer Clusters

Authors: Yi Zhou, Yingbin Liang, Yaoliang Yu, Wei Dai, Eric P. Xing

JMLR 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically verify the convergence properties and time efficiency of m PAPG. All data are generated via normal distribution with the columns being normalized to have unit norm. We first test the convergence properties of m-PAPG via a non-convex Lasso problem with the group regularizer 0,2, which takes the form minx Rd 1 2 Ax b 2 + λ x 0,2, (73) where we set sample size n = 1000 and dimension size d = 2000, and the group norm divides the whole model into 20 groups with equal dimension. We use 4 machines (cores) with each handling five groups of coordinates, and consider maximal staleness s = 0, 10, 20, 30, respectively.
Researcher Affiliation Academia Yi Zhou EMAIL Yingbin Liang EMAIL Department of Electrical and Computer Engineering The Ohio State University Yaoliang Yu EMAIL Department of Computer Science University of Waterloo Wei Dai EMAIL Eric P. Xing EMAIL Machine Learning Department Carnegie Mellon University
Pseudocode Yes Algorithm 1 Economic Implementation of m-PAPG 1: For the server: 2: while recieves update i from machine i do 3: + i 4: end while 5: while machine i sends a pull request do 6: send to machine i 7: end while 8: For machine i at active clock t Ti: 9: pull from the server 10: Ui proxη gi xi ηA i f ( ) xi 11: send i = Ai Ui to the server 12: update xi xi + Ui
Open Source Code No The paper mentions using 'Petuum Ho et al. (2013); Dai et al. (2014) a stale synchronous parallel system' but does not provide a statement or link for their own implementation code.
Open Datasets No All data are generated via normal distribution with the columns being normalized to have unit norm. We first test the convergence properties of m-PAPG via a non-convex Lasso problem with the group regularizer 0,2, which takes the form minx Rd 1 2 Ax b 2 + λ x 0,2, (73) where we set sample size n = 1000 and dimension size d = 2000, and the group norm divides the whole model into 20 groups with equal dimension.
Dataset Splits No The paper describes generating synthetic data with specified sample sizes and dimensions (e.g., n = 1000, d = 2000 or n = 1 Million, d = 100 Millions) but does not provide details on how this generated data is split into training, validation, or test sets.
Hardware Specification Yes We implement Algorithm 1 on Petuum Ho et al. (2013); Dai et al. (2014) a stale synchronous parallel system which updates the local parameter caches via stale synchronous communications. The system contains 100 computing nodes and each is equipped with 16 AMD Opteron processors and 16GB RAM linked by 1Gbps ethernet.
Software Dependencies No The paper mentions 'Petuum Ho et al. (2013); Dai et al. (2014) a stale synchronous parallel system' but does not specify a version number for Petuum or any other software dependency.
Experiment Setup Yes We set the learning rate to have the form η(αs) = 1/(Lf + 2Lαs), α > 0, that is, a linear dependency on staleness s as suggested by Theorem 5. Then we run Algorithm 1 with different staleness and use η(0), η(10), η (αs), respectively, where η (αs) is the largest step size we tuned for each s that achieves a stable convergence. We track the global model x(t) and plot the results in Figure 1. Next, we verify the time and communication efficiency of m-PAPG via an l1 regularized quadratic programming problem with very high dimensions, taking the form minx 1 2x A Ax + λ x 1. (74) We generate samples of size n = 1Million and dimension d = 100Millions. We implement Algorithm 1 on Petuum Ho et al. (2013); Dai et al. (2014) a stale synchronous parallel system which updates the local parameter caches via stale synchronous communications. The system contains 100 computing nodes and each is equipped with 16 AMD Opteron processors and 16GB RAM linked by 1Gbps ethernet. We fix the learning rate η = 10 3 and consider maximum staleness s = 0, 1, 3, 5, 7, respectively.