Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Authors: Can Karakus, Yifan Sun, Suhas Diggavi, Wotao Yin

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement the proposed technique on Amazon EC2 clusters, and demonstrate its performance over several learning problems, including matrix factorization, LASSO, ridge regression and logistic regression, and compare the proposed method with uncoded, asynchronous, and data replication strategies. ... Numerical Results
Researcher Affiliation Collaboration Can Karakus EMAIL Amazon Web Services East Palo Alto, CA 94303, USA Yifan Sun EMAIL Department of Computer Science University of British Columbia Vancouver, BC, Canada Suhas Diggavi EMAIL Department of Electrical and Computer Engineering University of California, Los Angeles Los Angeles, CA 90095, USA Wotao Yin EMAIL Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA
Pseudocode Yes Algorithm 1 Generic encoded distributed optimization procedure under data parallelism, at the master node. Algorithm 2 Generic encoded distributed optimization procedure under data parallelism, at worker node i. Algorithm 3 Encoded block coordinate descent at worker node i. Algorithm 4 Encoded block coordinate descent at the master node.
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the methodology described. It only states: "We implement the proposed technique on Amazon EC2 clusters".
Open Datasets Yes We next apply matrix factorization on the Movie Lens-1M dataset (Riedl and Konstan (1998)) for the movie recommendation task. ... In our next experiment, we apply logistic regression for document classification for Reuters Corpus Volume 1 (rcv1.binary) dataset from Lewis et al. (2004)
Dataset Splits Yes We withhold randomly 20% of these ratings to form an 80/20 train/test split. ... reserve 100,000 documents for the test set.
Hardware Specification Yes We implement distributed L-BFGS as described in Section 3 on an Amazon EC2 cluster using mpi4py Python package, over m = 32 m1.small instances as worker nodes, and a single c3.8xlarge instance as the central server. ... The Movielens experiment is run on a single 32-core machine with Linux 4.4. ... We implement the algorithm over 128 t2.medium worker nodes which collectively store the matrix X, and a c3.4xlarge master node. ... We use m = 128 t2.medium instances as worker nodes, and a single c3.4xlarge instance as the master node ... we use compute-optimized, high-performance, high-bandwidth c4.4xlarge and c4.large instances available through Amazon EC2
Software Dependencies No The paper mentions software like "mpi4py Python package" and "numpy.linalg.solve", but it does not specify any version numbers for these software components or any other libraries/frameworks used.
Experiment Setup Yes We choose b = 3, p = 15, and λ = 10, which achieves test RMSE 0.861... We choose λ = 0.6 and consider the sparsity recovery performance... We use logistic regression with ℓ2-regularization for the classification task, with the objective min w,b 1/n i=1 log(1 + exp(ziw + b)) + λ||w||^2... We compare the methods against each other by measuring the wall-clock time of optimization required to achieve a fixed mean-squared error. ... we determine to be λ = 0.025. Then, we set the MSE bar as 1.05 MMSE. ... In all cases, we use gradient descent with step size αt = 0.2, which is run for 120 steps.