Provably Near-Optimal Federated Ensemble Distillation with Negligible Overhead

Authors: Won-Jun Jang, Hyeon-Seo Park, Si-Hyeon Lee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on various image classification tasks demonstrate that the proposed method significantly outperforms baselines. Furthermore, we show that the additional communication cost, client-side privacy leakage, and client-side computational overhead introduced by our method are negligible, both in scenarios with and without a pre-existing server dataset.
Researcher Affiliation Academia 1School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. Correspondence to: Si-Hyeon Lee <EMAIL>.
Pseudocode Yes Algorithm 1 Federated learning with K clients for T communication rounds, with ensemble distillation exploiting unlabeled dataset on the server. Algorithm 2 Fed GO algorithm with K clients for T communication rounds. Algorithm 3 Discriminator update for Ed epochs.
Open Source Code Yes For ease of reproduction, our code is open-sourced (https://github.com/pupiu45/Fed GO).
Open Datasets Yes We employed datasets CIFAR-10/100 (Krizhevsky, 2009) (MIT license) and downsampled Image Net100 (Image Net100 dataset; Chrabaszcz et al., 2017).
Dataset Splits Yes Unless specified otherwise, the entire client dataset corresponds to half of the specified client dataset (half for each class), and each client dataset is sampled from the entire client dataset according to Dirichlet(α), akin to setups in Lin et al. (2020); Cho et al. (2022). α is set to 0.1 and 0.05 to represent data-heterogeneous scenarios. The server dataset corresponds to half of the specified server dataset (half for each class) without labels. ... Table 3. Server test accuracy (%) of our Fed GO and baselines on three image datasets at the 100-th communication round.
Hardware Specification Yes All experiments were conducted in Python 3.8.12 environment using a 64-core Intel 2.90GHz Xeon Gold 6226R CPU with 512GB memory, and an RTX 3090 GPU.
Software Dependencies Yes All experiments were conducted in Python 3.8.12 environment using a 64-core Intel 2.90GHz Xeon Gold 6226R CPU with 512GB memory, and an RTX 3090 GPU. We also implemented the algorithms using Py Torch with version 1.11.0.
Experiment Setup Yes During the ensemble distillation process, we trained both clients and server with the Adam optimizer (Kingma & Ba, 2015) at a learning rate of 0.001 with batch size 64, without weight decay. The (β1, β2) parameters for Adam were set to (0.9, 0.999). Additionally, we applied cosine annealing (Loshchilov & Hutter, 2022) to decay the server learning rate until the final communication round T = 100 as in Lin et al. (2020), except for the results of F.3 and F.5. For the client and server classifier training epochs, we performed a grid search to find the optimal number of training epochs. The initial grid was {5, 10, 30, 50}, and the experiments were conducted with 30 client epochs and 10 server epochs (Es = 10) for CIFAR-10/100. To leverage the increased number of steps due to the additional number of data, experiments on Image Net100 were conducted with 10 client classifier epochs and 3 server classifier epochs (Es = 3).