Adaptive Self-Distillation for Minimizing Client Drift in Heterogeneous Federated Learning
Authors: M Yashwanth, Gaurav Kumar Nayak, Arya Singh, Yogesh Simmhan, Anirban Chakraborty
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our approach through extensive experiments on multiple real-world benchmarks and show substantial gains in performance when the proposed regularizer is combined with popular FL methods. The link to the code is https://github.com/vcl-iisc/fed-adaptive-self-distillation. |
| Researcher Affiliation | Academia | M.Yashwanth EMAIL Indian Institute of Science Gaurav Kumar Nayak EMAIL Indian Institute of Technology (IIT) Roorkee Arya Singh EMAIL Indian Institute of Science Yogesh Simmhan EMAIL Indian Institute of Science Anirban Chakraborty EMAIL Indian Institute of Science |
| Pseudocode | No | The paper describes the proposed method through mathematical equations and textual explanations, for example, 'We now describe the proposed method where each client k minimizes the fk(w) as defined below Eq. (2). fk(w) Lk(w) + λLASD k (w) (2)'. There are no clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The link to the code is https://github.com/vcl-iisc/fed-adaptive-self-distillation. |
| Open Datasets | Yes | We perform the experiments on CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009), Tiny-Image Net (Le & Yang, 2015) datasets with different degrees of heterogeneity in the balanced settings |
| Dataset Splits | Yes | For generating non-iid data, Dirichlet distribution is used. To simulate the effect of label imbalance, for every client we sample the probability distribution over the classes from the aforementioned Dirichlet distribution pdir k = Dir(δ, C). ...By configuring the concentration parameter δ to 0.6 and 0.3, we sample the data using the Dirichlet distribution across the labels for each client from moderate to high heterogeneity by controlling δ. We set the total number of clients to 100 in all our experiments. We set the client participation rate to 0.1, i.e., 10 percent of clients are sampled on an average per communication round |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, processors, memory) used for running the experiments are mentioned in the paper. The paper refers to 'edge devices' in general terms but does not specify the hardware used for their experimental setup. |
| Software Dependencies | No | No specific version numbers for software libraries or dependencies are provided. The paper mentions 'We build our experiments using publicly available codebase by (Acar et al., 2021)' and 'We use Py Torch style representation.' but does not specify versions for these or other software. |
| Experiment Setup | Yes | We set the total number of clients to 100 in all our experiments. We set the client participation rate to 0.1, i.e., 10 percent of clients are sampled on an average per communication round...Hyperparameters: SGD algorithm with a learning rate of 0.1 and decay the learning rate per round of 0.998 is used to train the client models. Temperature τ is set to 2.0. We only tune the hyper-parameter λ. More hyperparameter setting details and impact of λ, τ are provided in Sec. A.3 and A.6 of the appendix, respectively. The batch-size (B) of 50 and learning rate of 0.1 with decay of 0.998 is employed for all the experiments unless specified. |