reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FedPop: Federated Population-based Hyperparameter Tuning

Authors: Haokun Chen, Denis Krompaß, Jindong Gu, Volker Tresp

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical validation on the common FL benchmarks and complex realworld FL datasets, including full-sized Non-IID Image Net1K, demonstrates the effectiveness of the proposed method, which substantially outperforms the concurrent state-of-the-art HP-tuning methods in FL. ... We conduct an extensive empirical analysis to investigate the proposed method and its viability. Firstly, we compare Fed Pop with the SOTA and other baseline methods on three common FL benchmarks following (Khodak et al. 2021). Subsequently, we validate our approach by tuning hyperparameters for complex real-world cross-silo FL settings.
Researcher Affiliation	Collaboration	1 Ludwig Maximilian University of Munich, Munich, Germany 2 Siemens Technology, Munich, Germany 3 University of Oxford, Oxford, England 4 Munich Center for Machine Learning, Munich, Germany EMAIL, EMAIL, EMAIL
Pseudocode	Yes	The pseudo codes of the proposed method are given in Algorithm 1.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository in the main text or supplementary information.
Open Datasets	Yes	We conduct experiments on three benchmark datasets on both vision and language tasks: (1) CIFAR-10 (Krizhevsky, Hinton et al. 2009), which is an image classification dataset containing 10 categories of realworld objects. (2) FEMNIST (Caldas et al. 2018), which includes gray-scale images of hand-written digits and English letters, producing a 62-way classification task. (3) shakespeare (Caldas et al. 2018) is a next-character prediction task and comprises sentences from Shakespeare s Dialogues. ... (1) PACS (Li et al. 2017)... (2) Office Home (Venkateswara et al. 2017)... (3) Domain Net (Peng et al. 2019)... full-sized Image Net-1K (Deng et al. 2009)
Dataset Splits	Yes	Each client k owns a training, validation, and testing set, denoted by T k, V k, and Ek, respectively. To simulate the communication capacity of a real-world federated system, we presume that there are exactly K N+ active clients joining each communication round. ... We investigate 2 different partitions of the datasets: (1) For i.i.d (IID) setting, we randomly shuffle the dataset and evenly distribute the data to each client. (2) For non-i.i.d (NIID) settings, we follow (Khodak et al. 2021; Caldas et al. 2018) and assume each client contains data from a specific writer in FEMNIST, or it represents an actor in Shakespeare. For CIFAR-10 dataset, we follow prior arts (Zhu, Hong, and Zhou 2021; Lin et al. 2020) to model Non-IID label distributions using Dirichlet distribution Dirx, in which a smaller x indicates higher data heterogeneity.
Hardware Specification	No	The paper does not provide specific details about the hardware used for the experiments (e.g., GPU models, CPU types, memory specifications).
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	We set the communication budget (Rt, Rc) to (4000, 800) for CIFAR10 and shakespeare, while (2000, 200) for FEMNIST following (Khodak et al. 2021; Caldas et al. 2018). Besides, We adopt 500 clients for CIFAR-10, 3550 clients for FEMNIST, and 1129 clients for Shakespeare. For the coefficients used in Fed Pop, we set the initial perturbation intensity ϵ0 to 0.1, the initial resampling probability p0 re to 0.1, and the quantile coefficient ρ to 3. The perturbation interval Tg for Fed Pop-G is set to 0.1Rc. Following (Khodak et al. 2021), we define α R3 and β R7, i.e., we tune learning rate, scheduler, and momentum for server-side aggregation (Agg), and learning rate, scheduler, momentum, weight-decay, the number of local epochs, batch-size, and dropout rate for local clients updates (Loc), respectively.