reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization

Authors: Shivam Pal, Aishwarya Gupta, Saqib Sarwar, Piyush Rai

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment on three publicly available datasets: EMNIST (Cohen et al., 2017), SVHN (Netzer et al., 2011) and CIFAR-10 (Krizhevsky, 2009). We evaluate Fed Ivon in a challenging and realistic scenario involving heterogeneous data distribution among a large number of clients with each client having very few training examples. We compare our proposed method Fed Ivon with Fed Avg (Mc Mahan et al., 2017) (simple aggregation of client models at server) and Fed Laplace (Liu et al., 2024) (using the Laplace s approximation to fit a Gaussian distribution to each client s local model followed by aggregation at the server). Table 1: Test accuracy (ACC), Expected Calib. Error (ECE), Negative Log Likelihood (NLL), and Brier Score (BS) Figure 2: Loss and Accuracy of various methods vs rounds for the EMNIST, SVHN, and CIFAR-10 datasets.
Researcher Affiliation	Academia	Shivam Pal EMAIL Dept. of Computer Science and Engineering IIT Kanpur Aishwarya Gupta EMAIL Dept. of Computer Science and Engineering IIT Kanpur Saqib Sarwar EMAIL Dept. of Computer Science and Engineering IIT Kanpur Piyush Rai EMAIL Dept. of Computer Science and Engineering and Dept. of Intelligent Systems IIT Kanpur
Pseudocode	Yes	Algorithm 1 Client_Update Algorithm 2 Fed Ivon Algorithm
Open Source Code	No	The paper does not explicitly provide a link to open-source code or state that code for the described methodology is released or available in supplementary materials.
Open Datasets	Yes	We experiment on three publicly available datasets: EMNIST (Cohen et al., 2017), SVHN (Netzer et al., 2011) and CIFAR-10 (Krizhevsky, 2009). For personalized FL experiments, we focus on two types of data heterogeneity on the clients similar to Zhu et al. (2023) for classification task. Class distribution skew: In class distribution skew, clients have data from only a limited set of classes. To simulate this, we use the CIFAR-10 dataset and assign each client data from a random selection of 5 out of the 10 classes. Class concept drift: To simulate class concept drift, we use the CIFAR-100 dataset, which includes 20 superclasses, each containing 5 subclasses.
Dataset Splits	Yes	EMNIST consists of 28x28 grayscale images of alphabets and digits (0-9) with a train and test split comprising 124800 and 20800 images respectively; however, in our experiments, we restrict to alphabets only. SVHN consists of 32x32 RGB images of house number plates categorized into 10 distinct classes, each corresponding to one of the ten digits. It has a train and test split of size 73252 and 26032 respectively. CIFAR-10 comprises 32x32 RGB images of objects classified into 10 classes with 50000 training images and 10000 test images. To simulate non-i.i.d. data distribution, we randomly sample inputs from the training split, partition the sampled inputs into shards, and distribute shards among clients to create class-imbalanced training data similar to (Chen & Chao, 2020).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions FLOPs and Runtime as metrics.
Software Dependencies	No	In our experiments, We use Adam optimizer with learning_rate=1e-3, weight_decay=2e-4 for Fed Avg and Fed Laplace method. IVON (Shen et al., 2024) optimizer is used for Fed Ivon with different hyperparameters given in Table 6. Linearly decaying learning rate is used in all the experiments. FLOPs/round is the average number of floating-point operations per round (estimated via Py Torch s profiler).
Experiment Setup	Yes	In our experiments, We use Adam optimizer with learning_rate=1e-3, weight_decay=2e-4 for Fed Avg and Fed Laplace method. IVON (Shen et al., 2024) optimizer is used for Fed Ivon with different hyperparameters given in Table 6. Linearly decaying learning rate is used in all the experiments. For all the baselines and Fed Ivon, we run the federated algorithm for 2000 communication rounds, selecting a randomly sampled 5% i.e., 10 clients per round. We train each client s model locally for 2 epochs using a batch size of 32. Table 6: Ivon Hyperparameters for FL experiments (initial learning rate 0.1, final learning rate 0.01, weight decay 2e-4, batch size 32, ESS (λ) 5000, initial hessian (h0) 2.0, 5.0, 1.0, MC sample while training 1, MC samples while test 500). Table 7: Ivon Hyperparameters for personalized FL experiments (initial learning rate 0.1, final learning rate 0.001, weight decay 1e-3, batch size 32, expected sample size (λ) 10000, initial hessian (h0) 1.0, MC sample while training 1, MC samples while test 64).