Adaptive Self-Distillation for Minimizing Client Drift in Heterogeneous Federated Learning

Authors: M Yashwanth, Gaurav Kumar Nayak, Arya Singh, Yogesh Simmhan, Anirban Chakraborty

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of our approach through extensive experiments on multiple real-world benchmarks and show substantial gains in performance when the proposed regularizer is combined with popular FL methods. The link to the code is https://github.com/vcl-iisc/fed-adaptive-self-distillation.
Researcher Affiliation Academia M.Yashwanth EMAIL Indian Institute of Science Gaurav Kumar Nayak EMAIL Indian Institute of Technology (IIT) Roorkee Arya Singh EMAIL Indian Institute of Science Yogesh Simmhan EMAIL Indian Institute of Science Anirban Chakraborty EMAIL Indian Institute of Science
Pseudocode No The paper describes the proposed method through mathematical equations and textual explanations, for example, 'We now describe the proposed method where each client k minimizes the fk(w) as defined below Eq. (2). fk(w) Lk(w) + λLASD k (w) (2)'. There are no clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The link to the code is https://github.com/vcl-iisc/fed-adaptive-self-distillation.
Open Datasets Yes We perform the experiments on CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009), Tiny-Image Net (Le & Yang, 2015) datasets with different degrees of heterogeneity in the balanced settings
Dataset Splits Yes For generating non-iid data, Dirichlet distribution is used. To simulate the effect of label imbalance, for every client we sample the probability distribution over the classes from the aforementioned Dirichlet distribution pdir k = Dir(δ, C). ...By configuring the concentration parameter δ to 0.6 and 0.3, we sample the data using the Dirichlet distribution across the labels for each client from moderate to high heterogeneity by controlling δ. We set the total number of clients to 100 in all our experiments. We set the client participation rate to 0.1, i.e., 10 percent of clients are sampled on an average per communication round
Hardware Specification No No specific hardware details (GPU/CPU models, processors, memory) used for running the experiments are mentioned in the paper. The paper refers to 'edge devices' in general terms but does not specify the hardware used for their experimental setup.
Software Dependencies No No specific version numbers for software libraries or dependencies are provided. The paper mentions 'We build our experiments using publicly available codebase by (Acar et al., 2021)' and 'We use Py Torch style representation.' but does not specify versions for these or other software.
Experiment Setup Yes We set the total number of clients to 100 in all our experiments. We set the client participation rate to 0.1, i.e., 10 percent of clients are sampled on an average per communication round...Hyperparameters: SGD algorithm with a learning rate of 0.1 and decay the learning rate per round of 0.998 is used to train the client models. Temperature τ is set to 2.0. We only tune the hyper-parameter λ. More hyperparameter setting details and impact of λ, τ are provided in Sec. A.3 and A.6 of the appendix, respectively. The batch-size (B) of 50 and learning rate of 0.1 with decay of 0.998 is employed for all the experiments unless specified.