Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback

Authors: Boxin Zhao, Lingxiao Wang, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Chaochao Chen, Mladen Kolar

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through both simulated and real data experiments, we empirically illustrate the advantages of the proposed client sampling algorithm over uniform sampling and existing online learning-based sampling strategies. The proposed adaptive sampling procedure is applicable beyond the FL problem studied here and can be used to improve the performance of stochastic optimization procedures such as stochastic gradient descent and stochastic coordinate descent.
Researcher Affiliation Collaboration Boxin Zhao EMAIL University of Chicago Booth School of Business Chicago, IL, 60637, USA Lingxiao Wang EMAIL New Jersey Institute of Technology Department of Data Science Newark, NJ, 07106, USA Ziqi Liu EMAIL Zhiqiang Zhang EMAIL Jun Zhou EMAIL Ant Financial Group Hangzhou, Zhejiang, China Chaochao Chen EMAIL College of Computer Science and Technology Zhejiang University Hangzhou, Zhejiang, China Mladen Kolar EMAIL University of Southern California Marshall School of Business Los Angeles, CA, 90089, USA
Pseudocode Yes Algorithm 1 OSMD Sampler Algorithm 2 Solver of Step 7 of Algorithm 1 Algorithm 3 Mini-batch SGD with OSMD Sampler Algorithm 4 Fed Avg with OSMD Sampler Algorithm 5 Adaptive-OSMD Sampler Algorithm 6 Adaptive-OSMD Sampler with Doubling Trick (Adaptive-Doubling-OSMD) Algorithm 7 Adaptive sampling without replacement
Open Source Code Yes Code to replicate the results in this paper is available at https://github.com/boxinz17/FL-Client-Sampling.
Open Datasets Yes We use three commonly used computer vision data sets: MNIST (Le Cun and Cortes, 2010), KMINIST (Clanuwat et al., 2018), and FMINST (Xiao et al., 2017).
Dataset Splits Yes We set the number of devices to be M = 500. To better simulate the situation where our method brings significant convergence speed improvement, we create a highly skewed sample size distribution of the training set among clients: 65% of clients have only one training sample, 20% of clients have 5 training samples, 10% of clients have 30 training samples, and 5% of clients have 100 training samples. This setting tries to illustrate a real-life situation where most of the data come from a small fraction of users, while most of the users have only a small number of samples. ... In addition, each client has 10 validation samples used to measure the prediction accuracy of the model over the training process.
Hardware Specification No The paper mentions 'University of Chicago Booth Mercury Computing Cluster' in the acknowledgments, but does not provide specific details about the hardware specifications (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not explicitly mention any specific software components with version numbers (e.g., Python, PyTorch, TensorFlow version) used to implement the methodology or run experiments.
Experiment Setup Yes We set the number of clients as M = 100, and each client has nm = 100 samples... The dimension d of the problem is set as d = 10... We use the stochastic gradient descent to make global updates. At each round t, we choose a subset of K = 5 clients... For each client m St, we choose a mini-batch of samples, Bt m, of size B = 10... The parameter w is updated as wt+1 = wt + µSGD i Btm (ym,i w , xm,i ) xm,i, where µSGD is the learning rate, set as µSGD = 0.1 in simulations. In all experiments, we set α in Adaptive-OSMD Sampler as α = 0.4. The tuning parameters for MABS, VRB and Avare are set as in their original papers. ... Learning rate in SGD is set to 0.075 for MNIST and KMNIST, and is set to 0.03 for FMNIST. The total number of communication rounds is to 1, 000. In each round of communication, we choose K = 10 clients to participate (2% of total number of clients). For a chosen client m, we compute its local mini-batch gradient with the batch size equal to min{5, nm}, where nm is the training sample size on the client m.