Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order Optimization
Authors: Zhe Li, Bicheng Ying, Zidong Liu, Chaosheng Dong, Haibo Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations, encompassing both classic deep learning training and large language model fine-tuning, demonstrate significant reductions in communication overhead. Notably, De Com FL achieves this by transmitting only around 1MB of data in total between the server and a client to fine-tune a model with billions of parameters. [...] Comprehensive experiments on both training and fine-tuning tasks demonstrate that De Com FL achieves comparable performance to existing algorithms while significantly reducing communication costs by several orders of magnitude. |
| Researcher Affiliation | Collaboration | Zhe Li1 , Bicheng Ying2 , Zidong Liu3, Chaosheng Dong4, Haibo Yang1 1Rochester Institute of Technology, Rochester, NY 14623, USA 2Google Inc., Los Angeles, CA 90034, USA 3Combo Curve Inc., Houston, TX 77005, USA 4Amazon.com Inc., Seattle, WA 98109, USA |
| Pseudocode | Yes | Algorithm 1 Dimension-Free Communication in Federated Learning (De Com FL) [Server-side] Algorithm 2 Dimension-Free Communication in Federated Learning (De Com FL) [Client-side] Algorithm 3 De Com FL (P > 1) [Server-side] Algorithm 4 De Com FL (P > 1) [Client-side] Algorithm 5 De Com FL with Gradient Projection [Client-side] |
| Open Source Code | Yes | The code is available at https://github.com/Zidong Liu/De Com FL. |
| Open Datasets | Yes | We begin by training a simple Convolutional Neural Network model from scratch on the MNIST image classification task (Le Cun et al., 1998). [...] We further evaluate this perturbation trick on Fashion (Xiao et al., 2017) with a larger CNN model [...] We utilize a series of Natural Language Processing(NLP) datasets to execute fine-tuning tasks on LLMs (e.g., OPT-125M and OPT-1.3B), such as SST-2 (Socher et al., 2013; Wang et al., 2018) for the sentiment classification task, CB (De Marneffe et al., 2019) for hypothesis inference problem, WSC (Kocijan et al., 2020) for commonsense reasoning task, WIC (Pilehvar & Camacho-Collados, 2018) for word sense disambiguation task, RTE (Bowman et al., 2015) for natural language inference task, and Bool Q (Clark et al., 2019) for question answering. |
| Dataset Splits | Yes | In the training tasks, our FL system comprises 100 clients, and we partition the dataset into 100 subsets by Dirichlet distribution (i.e., α = 1). Each subset is assigned to one sampled client. In each communication round, 10% of clients are randomly selected to participate in the training process. [...] Loading and splitting datasets are based on https://huggingface.co/datasets/super_glue. |
| Hardware Specification | No | No specific hardware (GPU models, CPU models, or memory) used for running the experiments is mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) are mentioned in the paper. |
| Experiment Setup | Yes | All algorithms use SGD optimizer with momentum 0.5. For De Com FL, we set the base learning rate as 0.001. [...] we use 25 perturbations at the beginning and double it at rounds 500, 1000, and 2000. [...] The quantization used in Fed Com is compressing each element in the parameters to 8 bits. [...] For the experiments on LLMs, there are eight clients in total, and in each communication round, only two clients are sampled to participate in the training. [...] In Table 5, we show the specific hyper-parameter settings about learning rate and total communication rounds. For other shared parameters, we set smooth parameter µ = 1e 3, Dirichlet concentration parameter α = 1, and local update step K = 1. For De Com FL s experiments, we set train batch size = 32 and test batch size = 64. |