reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Differentiated Aggregation to Improve Generalization in Federated Learning

Authors: Peyman Gholami, Hulya Seferoglu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results confirm that Fed ALS outperforms the baselines in terms of accuracy in non-iid setups while also saving on communication costs across all setups. Our codes are available for reproducibility. In this section, we assess the performance of Fed ALS using deep neural network model Res Net-20 for CIFAR-10, CIFAR-100 (Krizhevsky, 2009), SVHN (Netzer et al., 2011), and MNIST (Lecun et al., 1998) datasets.
Researcher Affiliation	Academia	Peyman Gholami EMAIL Department of Electrical and Computer Engineering University of Illinois Chicago; Hulya Seferoglu EMAIL Department of Electrical and Computer Engineering University of Illinois Chicago
Pseudocode	Yes	Algorithm 1 Fed ALS; Algorithm 2 Fed Avg; Algorithm 3 SCAFFOLD; Algorithm 4 Fed ALS + SCAFFOLD
Open Source Code	Yes	Our codes are available for reproducibility.
Open Datasets	Yes	We evaluate the performance of Fed ALS using deep neural network model Res Net-20 for CIFAR-10, CIFAR-100 (Krizhevsky, 2009), SVHN (Netzer et al., 2011), and MNIST (Lecun et al., 1998) datasets. We also estimate the impact of Fed ALS on large language models (LLMs) in fine-tuning OPT-125M (Zhang et al., 2022) on the Multi-Genre Natural Language Inference (Multi NLI) corpus (Williams et al., 2018).
Dataset Splits	Yes	For image classification, we initially sorted the data based on their labels and subsequently divided it among nodes following this sorted sequence. In Multi NLI, we sorted the sentences based on their genre. ... Table 4: Data distribution IID (shuffled and split), Non-IID (sorted based on labels then split) ... Table 5: Data distribution IID (shuffled and split), Non-IID (sorted based on genre then split)
Hardware Specification	No	The experimentation was conducted on a network consisting of five nodes alongside a central server. No specific hardware details (GPU/CPU models, memory) are provided.
Software Dependencies	No	SGD with momentum was employed as the optimizer, with the momentum set to 0.9, and the weight decay to 10 4. For the LLM fine-tuning, we employed a batch size of 16 sentences from the corpus, and the optimizer used was Adam W. Specific version numbers for software libraries or environments are not provided.
Experiment Setup	Yes	For image classification, we utilized a batch size of 64 per node. SGD with momentum was employed as the optimizer, with the momentum set to 0.9, and the weight decay to 10 4. For the LLM fine-tuning, we employed a batch size of 16 sentences from the corpus, and the optimizer used was Adam W. In all the experiments, to perform a grid search for the learning rate, we conducted each experiment by multiplying and dividing the learning rate by powers of two, stopping each experiment after reaching a local optimum learning rate. We repeat each experiment 20 times and present the error bars associated with the randomness of the optimization. In every figure, we include the average and standard deviation error bars. Detailed experimental setup is provided in Appendix E of the supplementary materials. ... Table 4: Local Steps τ 5, Adaptation coefficient α 10, Batch size 64 per client, Momentum 0.9, Weight decay 10^-4, Number of Iterations 10^4 for IID and 210^4 for non-IID, Repetitions 20. ... Table 5: Local Steps τ 5, Adaptation coefficient α 10, Batch size 16 sentences per client, Adam β1 0.9, Adam β2 0.999, Adam ε 10^-8, Number of Iterations 10^4 for IID and 210^4 for non-IID, Repetitions 20.