Buffer-based Gradient Projection for Continual Federated Learning

Authors: Shenghong Dai, Jy-yong Sohn, Yicong Chen, S M Iftekharul Alam, Ravikumar Balakrishnan, Suman Banerjee, Nageen Himayat, Kangwook Lee

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on standard benchmarks show consistent performance improvements across diverse scenarios. For example, in a task-incremental learning scenario using the CIFAR-100 dataset, our method can increase the accuracy by up to 27%. Our code is available at https://github.com/shenghongdai/Fed-A-GEM.
Researcher Affiliation Collaboration Shenghong Dai EMAIL Department of Electrical and Computer Engineering University of Wisconsin-Madison Jy-yong Sohn EMAIL Department of Applied Statistics Yonsei University Yicong Chen EMAIL Department of Electrical and Computer Engineering University of Wisconsin-Madison S M Iftekharul Alam EMAIL Intel Labs Ravikumar Balakrishnan EMAIL Intel Labs Suman Banerjee EMAIL Department of Computer Sciences University of Wisconsin-Madison Nageen Himayat EMAIL Intel Labs Kangwook Lee EMAIL Department of Electrical and Computer Engineering University of Wisconsin-Madison
Pseudocode Yes Algorithm 1 Fed Avg Server Update with Fed-A-GEM
Open Source Code Yes Our code is available at https://github.com/shenghongdai/Fed-A-GEM.
Open Datasets Yes We evaluate our approach on three CL scenarios: domain incremental learning (domain-IL), class incremental learning (class-IL), and task incremental learning (task-IL). We explain these three types of incremental learning (IL) settings with examples in Appendix A. For domain-IL, the data distribution of each class changes across different tasks. We use the rotated-MNIST (Lopez-Paz & Ranzato, 2017) and permuted-MNIST (Goodfellow et al., 2013) datasets for domain-IL, where each task rotates the training digits by a random angle or applies a random permutation. We create T = 10 tasks for domain-IL experiments. For class-IL and task-IL, we use the sequential-CIFAR10 (S-CIFAR10) and sequential-CIFAR100 (S-CIFAR100) datasets, which partition the set of classes into disjoint subsets and treat each subset as a separate task. ... a text classification task (Mehta et al., 2023) on sequential-Yahoo QA dataset (Zhang et al., 2015).
Dataset Splits Yes For the rotated-MNIST or permuted-MNIST dataset, each client receives samples for two MNIST digits. To create a sequential CIFAR10 or sequential-CIFAR100 dataset, we partition the dataset among multiple clients using a Dirichlet distribution (Hsu et al., 2019). Specifically,we draw q Dir(αp), where p represents a prior class distribution over N classes, and α is a concentration parameter that controls the degree of heterogeneity among clients. For our experiments, we use α = 0.3, which provides a moderate level of heterogeneity. ... For class-IL and task-IL, we use the sequential-CIFAR10 (S-CIFAR10) and sequential-CIFAR100 (S-CIFAR100) datasets, which partition the set of classes into disjoint subsets and treat each subset as a separate task. For instance, in our image classification experiments for class-IL and task-IL, we divide the CIFAR-100 dataset (with C = 100 classes) into T = 10 subsets, each of which contains the samples for C/T = 10 classes.
Hardware Specification Yes All experiments were conducted on a Linux workstation equipped with 8 NVIDIA Ge Force RTX 2080Ti GPUs and averaged across five runs, each using a different seed.
Software Dependencies No The paper mentions various frameworks and models used (e.g., Distil BERT, tiny YOLO, CARLA, Open FL), but does not provide specific version numbers for core software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For the rotated-MNIST and permuted-MNIST dataset, we use a simple CNN architecture (Mc Mahan et al., 2017), and split the dataset into K = 10 clients. Each client performs local training for E = 1 epoch between communications, and we set the number of communication rounds as R = 20 for each task. For the sequential-CIFAR10 and sequential-CIFAR100 datasets, we use a Res Net18 architecture, and divide the dataset into K = 10 clients. Each client trains for E = 5 epochs between communications, and uses R = 20 rounds of communication for each task. During local training, Stochastic Gradient Descent (SGD) is employed with a learning rate of 0.01 for MNIST and 0.1 for CIFAR datasets. Unless otherwise noted, the buffer size is set to B = 200, a negligible storage for edge devices.