Communication-Efficient Distributionally Robust Decentralized Learning
Authors: Matteo Zecchin, Marios Kountouris, David Gesbert
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we corroborate the theoretical findings with empirical results that highlight AD-GDA s ability to provide unbiased predictors and to greatly improve communication efficiency compared to existing distributionally robust algorithms. 5 Experiments In this section, we empirically evaluate AD-GDA capabilities in producing robust predictors. We first compare AD-GDA with CHOCO-SGD and showcase the merits of the distributionally robust procedure across different learning models, communication network topologies, and message compression schemes. |
| Researcher Affiliation | Academia | Matteo Zecchin EMAIL Communication Systems Department EURECOM, Sophia Antipolis, France Marios Kountouris EMAIL Communication Systems Department EURECOM, Sophia Antipolis, France David Gesbert EMAIL Communication Systems Department EURECOM, Sophia Antipolis, France |
| Pseudocode | Yes | Algorithm 1: Agnostic Decentralized GDA with Compressed Communication (AD-GDA) Input : Number of nodes m, number of iterations T, learning rates ηθ and ηλ, mixing matrix W , initial values θ0 Rd and λ0 m 1. Output : θo = 1 T PT 1 t=0 θt, λo = 1 T PT 1 t=0 λt initialize θ0 i = θ0, λ0 i = λ0 and s0 i = 0 for i = 1, . . . , m for t in 0, . . . T 1 do // In parallel at each node i 2 i θt i ηθ θgi(θt i, λt i, ξt i) // Descent Step 2 i PΛ λt i + ηλ λgi(θt i, λt i, ξt i) // Projected Ascent Step θt+1 i θ t+ 1 2 i + γ st i ˆθt i // Gossip qt i Q θt+1 i ˆθt i // Compression send (qt i, λ t+ 1 2 i ) to j N(i) and receive (qt j, λ t+ 1 2 j ) from j N(i) // Msgs exchange ˆθt+1 i qt i + ˆθt i // Public variables update st+1 i st i + Pm j=1 wi,jqj λt+1 i Pm j=1 wi,jλ t+ 1 2 j // Dual variable averaging end |
| Open Source Code | No | The paper does not provide any explicit statement about code release or a link to a code repository for the methodology described. It mentions a link to Open Review, which is a peer-review platform, not a code repository. |
| Open Datasets | Yes | mouse cell image classifier based on the Cells Out Of Sample 7-Class (COOS7) data set (Lu et al., 2019). We perform our experiments using the Fashion-MNIST data set (Xiao et al., 2017) 2, a popular data set made of images of 10 different clothing items, which is commonly used to test distributionally robust learners (Mohri et al., 2019; Deng et al., 2021). A CIFAR-10 image classification task based on 4-layer convolutional neural networks (CNN) (Krizhevsky et al., 2009). |
| Dataset Splits | No | samples are partitioned across the network devices using a class-wise split. Namely, using a workstation equipped with a GTX 1080 Ti, we simulate a network of 10 nodes, each storing data points coming from one of the 10 classes. In this setting, we train a logistic regression model and a two-layer fully connected neural network with 25 hidden units to investigate both the convex and the non-convex cases. |
| Hardware Specification | Yes | using a workstation equipped with a GTX 1080 Ti |
| Software Dependencies | No | In both cases, we use the SGD optimizer and, to ensure consensus at the end of the optimization process, we consider a geometrically decreasing learning rate ηt θ = r tη0 θ with ratio r = 0.995 and initial value η0 θ = 1. The paper mentions the SGD optimizer but does not provide specific version numbers for any software or libraries used. |
| Experiment Setup | Yes | In both cases, we use the SGD optimizer and, to ensure consensus at the end of the optimization process, we consider a geometrically decreasing learning rate ηt θ = r tη0 θ with ratio r = 0.995 and initial value η0 θ = 1. For this experiment we track the worst-node loss of a logistic model trained using a fixed learning rate ηθ = 0.1. In Table 2 we report the average worst-case accuracy attained by the final averaged model θT . We consider a regularizer of the form χ2(λ) := P i (λi ni/n)2 ni/n and run AD-GDA for α = {10, 1, 0.01}. The batch sizes are the same across all algorithms and are set to {50, 50, 32} for the F-MNIST, CIFAR10 and COOS7 experiments, respectively. All algorithms are run for T = 5000 iterations using an SGD optimizer and the same exponentially learning rate schedule ηt θ = r tη0 θ with decay r = 0.998. For AD-GDA we consider a chi-squared regularizer with α = 0.01, while for DR-DSGD we set the KL regularizer parameter to α = 6 as in Issaid et al. (2022). |