Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order Optimization

Authors: Zhe Li, Bicheng Ying, Zidong Liu, Chaosheng Dong, Haibo Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations, encompassing both classic deep learning training and large language model fine-tuning, demonstrate significant reductions in communication overhead. Notably, De Com FL achieves this by transmitting only around 1MB of data in total between the server and a client to fine-tune a model with billions of parameters. [...] Comprehensive experiments on both training and fine-tuning tasks demonstrate that De Com FL achieves comparable performance to existing algorithms while significantly reducing communication costs by several orders of magnitude.
Researcher Affiliation Collaboration Zhe Li1 , Bicheng Ying2 , Zidong Liu3, Chaosheng Dong4, Haibo Yang1 1Rochester Institute of Technology, Rochester, NY 14623, USA 2Google Inc., Los Angeles, CA 90034, USA 3Combo Curve Inc., Houston, TX 77005, USA 4Amazon.com Inc., Seattle, WA 98109, USA
Pseudocode Yes Algorithm 1 Dimension-Free Communication in Federated Learning (De Com FL) [Server-side] Algorithm 2 Dimension-Free Communication in Federated Learning (De Com FL) [Client-side] Algorithm 3 De Com FL (P > 1) [Server-side] Algorithm 4 De Com FL (P > 1) [Client-side] Algorithm 5 De Com FL with Gradient Projection [Client-side]
Open Source Code Yes The code is available at https://github.com/Zidong Liu/De Com FL.
Open Datasets Yes We begin by training a simple Convolutional Neural Network model from scratch on the MNIST image classification task (Le Cun et al., 1998). [...] We further evaluate this perturbation trick on Fashion (Xiao et al., 2017) with a larger CNN model [...] We utilize a series of Natural Language Processing(NLP) datasets to execute fine-tuning tasks on LLMs (e.g., OPT-125M and OPT-1.3B), such as SST-2 (Socher et al., 2013; Wang et al., 2018) for the sentiment classification task, CB (De Marneffe et al., 2019) for hypothesis inference problem, WSC (Kocijan et al., 2020) for commonsense reasoning task, WIC (Pilehvar & Camacho-Collados, 2018) for word sense disambiguation task, RTE (Bowman et al., 2015) for natural language inference task, and Bool Q (Clark et al., 2019) for question answering.
Dataset Splits Yes In the training tasks, our FL system comprises 100 clients, and we partition the dataset into 100 subsets by Dirichlet distribution (i.e., α = 1). Each subset is assigned to one sampled client. In each communication round, 10% of clients are randomly selected to participate in the training process. [...] Loading and splitting datasets are based on https://huggingface.co/datasets/super_glue.
Hardware Specification No No specific hardware (GPU models, CPU models, or memory) used for running the experiments is mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) are mentioned in the paper.
Experiment Setup Yes All algorithms use SGD optimizer with momentum 0.5. For De Com FL, we set the base learning rate as 0.001. [...] we use 25 perturbations at the beginning and double it at rounds 500, 1000, and 2000. [...] The quantization used in Fed Com is compressing each element in the parameters to 8 bits. [...] For the experiments on LLMs, there are eight clients in total, and in each communication round, only two clients are sampled to participate in the training. [...] In Table 5, we show the specific hyper-parameter settings about learning rate and total communication rounds. For other shared parameters, we set smooth parameter µ = 1e 3, Dirichlet concentration parameter α = 1, and local update step K = 1. For De Com FL s experiments, we set train batch size = 32 and test batch size = 64.