Adaptive Compression for Communication-Efficient Distributed Training

Authors: Maksim Makarenko, Elnur Gasanov, Abdurakhmon Sadiev, Rustem Islamov, Peter Richtárik

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose Adaptive Compressed Gradient Descent (Ada CGD) a novel optimization algorithm for communication-efficient training of supervised machine learning models with adaptive compression level. ... In this work, we use the similar setup described in (Richtárik et al., 2022). Namely, we aim to solve the logistic regression problem with a nonconvex regularizer: ... Figure 1 compares the performance of our proposed algorithm, Ada CGD, with other popular 3PC methods.
Researcher Affiliation Academia Maksim Makarenko EMAIL King Abdullah University of Science and Technology Elnur Gasanov EMAIL King Abdullah University of Science and Technology Rustem Islamov EMAIL Institut Polytechnique de Paris Abdurakhmon Sadiev EMAIL King Abdullah University of Science and Technology Peter Richtárik EMAIL King Abdullah University of Science and Technology
Pseudocode Yes Algorithm 1 DCGD method with master compression
Open Source Code No The paper does not provide an explicit link to source code, nor does it state that the code will be made publicly available. It only mentions that simulations are implemented in Python.
Open Datasets Yes In training we use LIBSVM Chang & Lin (2011) datasets phishing, a1a, a9a.
Dataset Splits Yes Each dataset has been split into 𝑛= 20 equal parts, each representing a different client.
Hardware Specification Yes We implemented all simulations in Python 3.8, and ran them on a cluster of 48 nodes with Intel(R) Xeon(R) Gold 6230R CPUs.
Software Dependencies Yes All simulations are implemented in Python 3.8
Experiment Setup Yes In training we use LIBSVM Chang & Lin (2011) datasets phishing, a1a, a9a. Each dataset has been split into 𝑛= 20 equal parts, each representing a different client. ... we fine-tuned the stepsize of each considered algorithm with a set of multiples of the corresponding theoretical stepsize, ranging from 20 to 28. ... we chose the Top-𝑘operator as our compressor of choice. For EF21 and CLAG, we used the top-1 compressor ... For Ada CGD, we chose an array of compressors that varied from full compression (skip communication) to zero compression (sending the full gradient), with a step of 5. We used the communication cost of the algorithm as the stopping criterion for all experiments.