Communication-Efficient Distributionally Robust Decentralized Learning

Authors: Matteo Zecchin, Marios Kountouris, David Gesbert

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we corroborate the theoretical findings with empirical results that highlight AD-GDA s ability to provide unbiased predictors and to greatly improve communication efficiency compared to existing distributionally robust algorithms. 5 Experiments In this section, we empirically evaluate AD-GDA capabilities in producing robust predictors. We first compare AD-GDA with CHOCO-SGD and showcase the merits of the distributionally robust procedure across different learning models, communication network topologies, and message compression schemes.
Researcher Affiliation Academia Matteo Zecchin EMAIL Communication Systems Department EURECOM, Sophia Antipolis, France Marios Kountouris EMAIL Communication Systems Department EURECOM, Sophia Antipolis, France David Gesbert EMAIL Communication Systems Department EURECOM, Sophia Antipolis, France
Pseudocode Yes Algorithm 1: Agnostic Decentralized GDA with Compressed Communication (AD-GDA) Input : Number of nodes m, number of iterations T, learning rates ηθ and ηλ, mixing matrix W , initial values θ0 Rd and λ0 m 1. Output : θo = 1 T PT 1 t=0 θt, λo = 1 T PT 1 t=0 λt initialize θ0 i = θ0, λ0 i = λ0 and s0 i = 0 for i = 1, . . . , m for t in 0, . . . T 1 do // In parallel at each node i 2 i θt i ηθ θgi(θt i, λt i, ξt i) // Descent Step 2 i PΛ λt i + ηλ λgi(θt i, λt i, ξt i) // Projected Ascent Step θt+1 i θ t+ 1 2 i + γ st i ˆθt i // Gossip qt i Q θt+1 i ˆθt i // Compression send (qt i, λ t+ 1 2 i ) to j N(i) and receive (qt j, λ t+ 1 2 j ) from j N(i) // Msgs exchange ˆθt+1 i qt i + ˆθt i // Public variables update st+1 i st i + Pm j=1 wi,jqj λt+1 i Pm j=1 wi,jλ t+ 1 2 j // Dual variable averaging end
Open Source Code No The paper does not provide any explicit statement about code release or a link to a code repository for the methodology described. It mentions a link to Open Review, which is a peer-review platform, not a code repository.
Open Datasets Yes mouse cell image classifier based on the Cells Out Of Sample 7-Class (COOS7) data set (Lu et al., 2019). We perform our experiments using the Fashion-MNIST data set (Xiao et al., 2017) 2, a popular data set made of images of 10 different clothing items, which is commonly used to test distributionally robust learners (Mohri et al., 2019; Deng et al., 2021). A CIFAR-10 image classification task based on 4-layer convolutional neural networks (CNN) (Krizhevsky et al., 2009).
Dataset Splits No samples are partitioned across the network devices using a class-wise split. Namely, using a workstation equipped with a GTX 1080 Ti, we simulate a network of 10 nodes, each storing data points coming from one of the 10 classes. In this setting, we train a logistic regression model and a two-layer fully connected neural network with 25 hidden units to investigate both the convex and the non-convex cases.
Hardware Specification Yes using a workstation equipped with a GTX 1080 Ti
Software Dependencies No In both cases, we use the SGD optimizer and, to ensure consensus at the end of the optimization process, we consider a geometrically decreasing learning rate ηt θ = r tη0 θ with ratio r = 0.995 and initial value η0 θ = 1. The paper mentions the SGD optimizer but does not provide specific version numbers for any software or libraries used.
Experiment Setup Yes In both cases, we use the SGD optimizer and, to ensure consensus at the end of the optimization process, we consider a geometrically decreasing learning rate ηt θ = r tη0 θ with ratio r = 0.995 and initial value η0 θ = 1. For this experiment we track the worst-node loss of a logistic model trained using a fixed learning rate ηθ = 0.1. In Table 2 we report the average worst-case accuracy attained by the final averaged model θT . We consider a regularizer of the form χ2(λ) := P i (λi ni/n)2 ni/n and run AD-GDA for α = {10, 1, 0.01}. The batch sizes are the same across all algorithms and are set to {50, 50, 32} for the F-MNIST, CIFAR10 and COOS7 experiments, respectively. All algorithms are run for T = 5000 iterations using an SGD optimizer and the same exponentially learning rate schedule ηt θ = r tη0 θ with decay r = 0.998. For AD-GDA we consider a chi-squared regularizer with α = 0.01, while for DR-DSGD we set the KL regularizer parameter to α = 6 as in Issaid et al. (2022).