reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalization in Federated Learning: A Conditional Mutual Information Framework

Authors: Ziqiao Wang, Cheng Long, Yongyi Mao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations confirm that our evaluated CMI bounds are non-vacuous and accurately capture the generalization behavior of FL algorithms. To verify our results, we conduct FL experiments using Fed Avg (Mc Mahan et al., 2017) on two datasets. We conduct image classification experiments on two datasets: MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky, 2009).
Researcher Affiliation	Academia	1School of Computer Science and Technology, Tongji University, Shanghai, China 2Department of Applied Physics and Applied Mathematics, Columbia University, New York, USA 3School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada.
Pseudocode	No	The paper describes the methodology mathematically and textually but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We adapt the code from https://github.com/hrayrhar/f-CMI for supersample construction and CMI computation, and we use the FL training code from https://github.com/vaseline555/Federated-Learning-in-Py Torch.
Open Datasets	Yes	We conduct image classification experiments on two datasets: MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky, 2009).
Dataset Splits	Yes	Additionally, we apply a pathological non-IID data partitioning scheme as in Mc Mahan et al. (2017): data are sorted by label, split into 200 shards of size 300, and each client is randomly assigned 2 shards, and we evaluate prediction error as our performance metric... When analyzing generalization behavior concerning the sample size n, we fix the superclient size at 100, leading 50 participating clients and 50 non-participating clients randomly selected by V. The sample size per client varies within n {10, 50, 100, 250}. When analyzing generalization behavior with respect to the number of participating clients K, we set n = 100 for MNIST and n = 50 for CIFAR-10, varying the number of clients as K {10, 20, 30, 50}.
Hardware Specification	Yes	All experiments are performed using NVIDIA A100 GPUs with 40 GB of memory.
Software Dependencies	No	We adapt the code from https://github.com/hrayrhar/f-CMI for supersample construction and CMI computation, and we use the FL training code from https://github.com/vaseline555/Federated-Learning-in-Py Torch. No specific version numbers for libraries or frameworks are provided.
Experiment Setup	Yes	Each local training algorithm Ai trains this model using full-batch GD with an initial learning rate of 0.1, which decays by a factor of 0.01 every 10 steps. At each FL round, clients train locally for 5 epochs before sending their models to the central server. The entire training process spans communication 300 rounds between clients and the central server... Each local training algorithm Ai trains the CNN model using SGD with a mini-batch size of 50 and follows the same learning rate schedule as in the MNIST experiment. As in the MNIST setup, clients train locally for five epochs per round before sending their models to the central server, with training spanning 300 communication rounds.