Understanding the Role of Layer Normalization in Label-Skewed Federated Learning
Authors: Guojun Zhang, Mahdi Beitollahi, Alex Bie, Xi Chen
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that normalization leads to drastic improvements on standard benchmarks under extreme label shift. Moreover, we conduct extensive ablation studies to understand the critical factors of layer normalization in FL. Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices, especially under extreme label shift where each client has access to few classes. |
| Researcher Affiliation | Industry | Guojun Zhang EMAIL Huawei Noah s Ark Lab |
| Pseudocode | No | The paper describes methods and theoretical derivations but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/ huawei-noah/Federated-Learning/tree/main/Layer_Normalization. |
| Open Datasets | Yes | We test the comparison on several common datasets including CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Tiny Image Net (Le & Yang, 2015) with CNN and Res Net-18 (He et al., 2016). For the PACS dataset, to add distribution shift to the clients, we first split the dataset into 4 groups, photo (P), art painting (A), clipart (C) and sketch (S). |
| Dataset Splits | Yes | To simulate the label shift problem, we create two types of data partitioning. One is n class(es) partitioning where each client has only access to data from n class(es). Another is Dirichlet partitioning from Hsu et al. (2019). Our setup includes 10 clients for the CIFAR-10 dataset, 50 and 20 clients for the CIFAR-100 dataset, 200 clients for Tiny Image Net, and 12 clients for PACS. More specifically, we use one class per client, two classes per client, and Dirichlet allocation with β = 0.1 for the CIFAR-10 dataset as shown in Figure 8. For the CIFAR-100 dataset, we use two classes per client (50 clients), 5 classes per client (20 clients), and Dirichlet allocation with β = 0.1 (20 clients). For Tiny Image Net, we apply Dirichlet allocation with β = 0.1, 0.2, 0.5. Finally, for the PACS dataset, to add distribution shift to the clients, we first split the dataset into 4 groups, photo (P), art painting (A), clipart (C) and sketch (S). For each group, we split the data into 3 clients, with Dirichlet partitioning Dir(0.5) and Dir(1.0), as well as partitioning into disjoint sets of classes (two clients have two classes each, and the other client has three classes of samples). Statistics on the number of clients and examples in both the training and test splits of the datasets are given in Table 8. |
| Hardware Specification | Yes | Our experiments are run on a cluster of 12 GPUs, including NVIDIA V100 and P100. |
| Software Dependencies | No | The paper mentions using SGD as an optimizer and various models (CNN, Res Net), but it does not specify any software libraries or dependencies with version numbers. |
| Experiment Setup | Yes | We use SGD with a 0.01 learning rate (lr) and batch size of 32 for all of the experiments except for E = 1 experiments in CIFAR-100 in which we take lr = 0.1 as the learning rate and lr = 0.001 for PACS. We use SGD with a momentum of 0.9 only for our centralized training baseline. In each client, we take E steps of local training in which we iterate E batches of data per client. We use online augmentation with random horizontal flip and random cropping with padding of 4 pixels for all of the datasets. Moreover, we test and utilize Fed Yogi (Reddi et al., 2020) as a server adaptive optimizer method in combination with FN. All the reported experiments are done with 10,000 global rounds. Our hyperparameter choices are summarized in Table 9. |