Adaptive Compression for Communication-Efficient Distributed Training
Authors: Maksim Makarenko, Elnur Gasanov, Abdurakhmon Sadiev, Rustem Islamov, Peter Richtárik
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose Adaptive Compressed Gradient Descent (Ada CGD) a novel optimization algorithm for communication-efficient training of supervised machine learning models with adaptive compression level. ... In this work, we use the similar setup described in (Richtárik et al., 2022). Namely, we aim to solve the logistic regression problem with a nonconvex regularizer: ... Figure 1 compares the performance of our proposed algorithm, Ada CGD, with other popular 3PC methods. |
| Researcher Affiliation | Academia | Maksim Makarenko EMAIL King Abdullah University of Science and Technology Elnur Gasanov EMAIL King Abdullah University of Science and Technology Rustem Islamov EMAIL Institut Polytechnique de Paris Abdurakhmon Sadiev EMAIL King Abdullah University of Science and Technology Peter Richtárik EMAIL King Abdullah University of Science and Technology |
| Pseudocode | Yes | Algorithm 1 DCGD method with master compression |
| Open Source Code | No | The paper does not provide an explicit link to source code, nor does it state that the code will be made publicly available. It only mentions that simulations are implemented in Python. |
| Open Datasets | Yes | In training we use LIBSVM Chang & Lin (2011) datasets phishing, a1a, a9a. |
| Dataset Splits | Yes | Each dataset has been split into 𝑛= 20 equal parts, each representing a different client. |
| Hardware Specification | Yes | We implemented all simulations in Python 3.8, and ran them on a cluster of 48 nodes with Intel(R) Xeon(R) Gold 6230R CPUs. |
| Software Dependencies | Yes | All simulations are implemented in Python 3.8 |
| Experiment Setup | Yes | In training we use LIBSVM Chang & Lin (2011) datasets phishing, a1a, a9a. Each dataset has been split into 𝑛= 20 equal parts, each representing a different client. ... we fine-tuned the stepsize of each considered algorithm with a set of multiples of the corresponding theoretical stepsize, ranging from 20 to 28. ... we chose the Top-𝑘operator as our compressor of choice. For EF21 and CLAG, we used the top-1 compressor ... For Ada CGD, we chose an array of compressors that varied from full compression (skip communication) to zero compression (sending the full gradient), with a step of 5. We used the communication cost of the algorithm as the stopping criterion for all experiments. |