Understanding the Role of Layer Normalization in Label-Skewed Federated Learning

Authors: Guojun Zhang, Mahdi Beitollahi, Alex Bie, Xi Chen

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that normalization leads to drastic improvements on standard benchmarks under extreme label shift. Moreover, we conduct extensive ablation studies to understand the critical factors of layer normalization in FL. Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices, especially under extreme label shift where each client has access to few classes.
Researcher Affiliation Industry Guojun Zhang EMAIL Huawei Noah s Ark Lab
Pseudocode No The paper describes methods and theoretical derivations but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/ huawei-noah/Federated-Learning/tree/main/Layer_Normalization.
Open Datasets Yes We test the comparison on several common datasets including CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Tiny Image Net (Le & Yang, 2015) with CNN and Res Net-18 (He et al., 2016). For the PACS dataset, to add distribution shift to the clients, we first split the dataset into 4 groups, photo (P), art painting (A), clipart (C) and sketch (S).
Dataset Splits Yes To simulate the label shift problem, we create two types of data partitioning. One is n class(es) partitioning where each client has only access to data from n class(es). Another is Dirichlet partitioning from Hsu et al. (2019). Our setup includes 10 clients for the CIFAR-10 dataset, 50 and 20 clients for the CIFAR-100 dataset, 200 clients for Tiny Image Net, and 12 clients for PACS. More specifically, we use one class per client, two classes per client, and Dirichlet allocation with β = 0.1 for the CIFAR-10 dataset as shown in Figure 8. For the CIFAR-100 dataset, we use two classes per client (50 clients), 5 classes per client (20 clients), and Dirichlet allocation with β = 0.1 (20 clients). For Tiny Image Net, we apply Dirichlet allocation with β = 0.1, 0.2, 0.5. Finally, for the PACS dataset, to add distribution shift to the clients, we first split the dataset into 4 groups, photo (P), art painting (A), clipart (C) and sketch (S). For each group, we split the data into 3 clients, with Dirichlet partitioning Dir(0.5) and Dir(1.0), as well as partitioning into disjoint sets of classes (two clients have two classes each, and the other client has three classes of samples). Statistics on the number of clients and examples in both the training and test splits of the datasets are given in Table 8.
Hardware Specification Yes Our experiments are run on a cluster of 12 GPUs, including NVIDIA V100 and P100.
Software Dependencies No The paper mentions using SGD as an optimizer and various models (CNN, Res Net), but it does not specify any software libraries or dependencies with version numbers.
Experiment Setup Yes We use SGD with a 0.01 learning rate (lr) and batch size of 32 for all of the experiments except for E = 1 experiments in CIFAR-100 in which we take lr = 0.1 as the learning rate and lr = 0.001 for PACS. We use SGD with a momentum of 0.9 only for our centralized training baseline. In each client, we take E steps of local training in which we iterate E batches of data per client. We use online augmentation with random horizontal flip and random cropping with padding of 4 pixels for all of the datasets. Moreover, we test and utilize Fed Yogi (Reddi et al., 2020) as a server adaptive optimizer method in combination with FN. All the reported experiments are done with 10,000 global rounds. Our hyperparameter choices are summarized in Table 9.