Gradient Masked Averaging for Federated Learning

Authors: Irene Tenison, Sai Aravind Sreeramadas, Vaikkunth Mugunthan, Edouard Oyallon, Irina Rish, Eugene Belilovsky

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on multiple FL algorithms with in-distribution, realworld, feature-skewed out-of-distribution, and quantity imbalanced datasets and show that it provides consistent improvements, particularly in the case of heterogeneous clients. In this section, we show empirically that the proposed GMA tends to outperform standard aggregation, converging at similar or better than standard aggregation, while enhancing the global model generalization.
Researcher Affiliation Academia 1Massachussetts Institute of Technology, MA, USA, 2Universite de Montreal, Quebec, Canada, 3Mila, Quebec AI Institute, Canada, 4Sorbonne University, Paris, France, 5Concordia University, Quebec, Canada
Pseudocode Yes Algorithm 1 Gradient Masked Fed AVG Mc Mahan et al. (2017) Server Executes: Initialize w0 Rd randomly for each server epoch, t = 1, 2, 3, ..., T do
Open Source Code Yes Code for our experiments is include in the supplementary materials and will be made available at the time of publication. The code is available at https://github.com/arvi797/FL.
Open Datasets Yes Details of the datasets explored and the respective skews induced is summarised in Table 1. The datasets mentioned include MNIST, FMNIST, CIFAR10, CIFAR100, Tiny Image Net, FEMNIST, and Fed CMNIST.
Dataset Splits Yes Our experiment include label distribution skew, feature distribution skew, quantity skew, real-world data distribution, and mixed (label and feature) skew. The test data consists of data from the same domain as the train dataset distributed across clients but from one user (or a set of users) not included in the set of train clients. To simulate this quantity imbalance, we have used a Dirichlet distribution based quantity skew with β = 0.5 as in Li et al. (2021a) on CIFAR-10 across 100 and 10 clients with 10 clients participating in each communication round.
Hardware Specification No No specific hardware details (like GPU/CPU models or memory) are provided in the paper. The paper only refers to clients and servers in a general federated learning context.
Software Dependencies No An SGD optimizer with a momentum (ρ = 0.9) and cross-entropy loss was used to train each client before aggregation at the server in all our experiments. The momentum parameters of adaptive federated optimizers are fixed at β1 = 0.9 and β2 = 0.99 as per Reddi et al. (2021). For our experiments, we used CIFAR-10 and Res Net models. No specific software versions for libraries or programming languages are mentioned.
Experiment Setup Yes An SGD optimizer with a momentum (ρ = 0.9) and cross-entropy loss was used to train each client before aggregation at the server in all our experiments. The momentum parameters of adaptive federated optimizers are fixed at β1 = 0.9 and β2 = 0.99 as per Reddi et al. (2021). For each of the considered algorithms we tune the local client model learning rates, global model learning rates, tau, and number of local epochs (CIFAR and Tiny Image Net) to consider the best performances of the algorithms. More details on the hyperparameter tuning is given in Appendix. The grid was fixed the same for all datasets and algorithms... ηl {10-3, 10-2, 5.10-2, 10-1} ηg {10-2, 10-1, 1, 1.5, 2} τ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. Throughout the experiments we have maintained τ = 0.4.