reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data

Authors: Jianyu Wang, Rudrajit Das, Gauri Joshi, Satyen Kale, Zheng Xu, Tong Zhang

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By performing experiments on naturally heterogeneous federated datasets, we show that previous theoretical predictions do not align well with practice. Fed Avg can have nearly identical performance on both IID and non-IID versions of these datasets. Thus, previous worst-case analyses may be too pessimistic for such datasets. [...] We conduct some experiments on Stack Overflow, a naturally non-IID split dataset for next-word prediction. [...] In Figure 3, we first run mini-batch SGD on Federated EMNIST (FEMNIST) (Mc Mahan et al., 2017) and Stack Overflow Next Word Prediction datasets (Reddi et al., 2019) to obtain an approximation for the optimal model w . Then we evaluate the average drift at optimum ρ = Ec Bc(w ) and its upper bound as given in (7) on these datasets.
Researcher Affiliation	Collaboration	Jianyu Wang EMAIL Carnegie Mellon University; Rudrajit Das EMAIL University of Texas at Austin; Gauri Joshi EMAIL Carnegie Mellon University; Satyen Kale EMAIL Google Research; Zheng Xu EMAIL Google Research; Tong Zhang EMAIL University of Illinois Urbana-Champaign
Pseudocode	No	The paper describes the Federated Averaging algorithm in detail in Section 2, including its update rule (Equation 2), but does not present it in a structured pseudocode or algorithm block.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository.
Open Datasets	Yes	In Figure 3, we first run mini-batch SGD on Federated EMNIST (FEMNIST) (Mc Mahan et al., 2017) and Stack Overflow Next Word Prediction datasets (Reddi et al., 2019) to obtain an approximation for the optimal model w . Then we evaluate the average drift at optimum ρ = Ec Bc(w ) and its upper bound as given in (7) on these datasets. [...] We run the same set of experiments on a non-IID CIFAR-100 dataset.
Dataset Splits	Yes	From the naturally heterogeneous Stack Overflow dataset, we create its IID version by aggregating and shuffling the data from all clients, and then re-assigning the IID data back to clients. [...] For example, each client may only hold one or very few classes of data (Zhao et al., 2018), or has data for all classes but the amount of each class is randomly drawn from a Dirichlet distribution (Hsu et al., 2019).
Hardware Specification	No	The paper provides details on the models (Conv Net, LSTM), loss functions, number of clients, local optimizer, and local learning rates in Table 3, but does not specify any hardware components like GPU or CPU models used for running the experiments.
Software Dependencies	No	The paper states 'we strictly follow the training setup given in Reddi et al. (2020)' for experiments on FEMNIST, Stack Overflow, and CIFAR-100 datasets, but it does not explicitly list any specific software dependencies or their version numbers.
Experiment Setup	Yes	In Table 3, the paper provides specific experimental details for FEMNIST, Stack Overflow, and CIFAR-100 datasets, including the model type (Conv Net, LSTM), loss function (Cross-Entropy), number of clients (500, 1000, 200), local optimizer (GD), and local learning rate (0.1, 0.5).