reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Heterogeneous Federated Learning with Scalable Server Mixture-of-Experts

Authors: Jingang Jiang, Yanzhao Chen, Xiangyang Liu, Haiqi Jiang, Chenyou Fan

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments We verify our approach on the benchmark Federated Extended MNIST (FEMNIST) [Caldas et al., 2018], CIFAR10 [Krizhevsky, 2009] for image classification task, SENT140 [Caldas et al., 2018] for textual sentiment classification task and YELP [Zhang et al., 2015] for 5-way review star classification task. 5.1 Fed-Mo E and Baselines for comparison... 5.3 Results with Various Datasets and Settings... 5.4 Ablation Studies...
Researcher Affiliation	Academia	Jingang Jiang , Yanzhao Chen , Xiangyang Liu , Haiqi Jiang and Chenyou Fan South China Normal University, Guangzhou, China EMAIL
Pseudocode	Yes	Algorithm 1 Fed-Mo E overview. while round e E do Stage-A: Local client training and uploading /* Upload m participating client models to server. / M {M1, M2, ..., Mm 1, Mm}. Stage-B: Server Mo E iterative update / Step-0: Probe client experts responses on Dr. / P M(X), (X, y) D // client responses Py P [:, y] RK 1 // label confidence while t T do / Step-1: Get server gating responses. / Q G(X) RK 1 // gating of Eq.(3) / Step-2: Get server-client correlation. / W r σrow(Q P y ) RK m / Step-3: Update server experts by moving Fed Avg. / Et+1 0 (1 λ) Et 0 + λ M, Et+1 i (1 λ) Et i + λ W r M , i 1 . / Step-4: Update server gating module. / Gt+1 Gt η Lgate , αt+1 αt η Lce G . // Loss Eq.(9) Stage-C: Synchronize model E back to clients. / Get updated server gating. / Q G(X) // G updated in Stage-B / Get updated server-client correlation. / W c = σcol(cat(1 α, α (Q P y )) RK m / Update client experts by moving Fed Avg. */ M = λ M + (1 λ) (W c) E // Eq.(13)
Open Source Code	No	The paper does not contain an explicit statement about the release of open-source code nor provides a link to a code repository for the methodology described.
Open Datasets	Yes	We verify our approach on the benchmark Federated Extended MNIST (FEMNIST) [Caldas et al., 2018], CIFAR10 [Krizhevsky, 2009] for image classification task, SENT140 [Caldas et al., 2018] for textual sentiment classification task and YELP [Zhang et al., 2015] for 5-way review star classification task.
Dataset Splits	Yes	Vision data split. We follow the original Non-IID split of FEMNIST according to different writing styles of 3500 users. We choose 50 and 100 clients as two FL scenarios for FEMNIST, each having 6200 and 5650 data, respectively. On CIFAR-10, we simulate highly Non-IID scenarios by distributing data classes using a Dirichlet distribution (α=1.0) to ensure that each client gets a unique, proportionately varied subset of classes. For 50 and 100 cases, each client has 750 and 375 data, respectively. On the sentiment analysis benchmark SENT-140 [Caldas et al., 2018], we follow Fan et al. 2022 to evaluate as a binary classification task. We reserve 100 clients, each having 190 sentences for each class. The server reserves \|Dr\| = 1000 sentences. We tokenize each sentence to a max of 64 words. The Yelp 5-way classification task... We configured 100 clients, where each client gathers 5,000 data samples. The data is partitioned in a Non-IID fashion, ensuring that one class predominates on each client. The server reserves \|Dr\| = 1000 samples.
Hardware Specification	Yes	We perform all experiments on a system with 3 Nvidia 4090 24G graphics cards, with M = 50 and 100 clients to build a large-scale FL system.
Software Dependencies	No	The paper mentions models like BERT and GPT-2, but does not provide specific version numbers for any software dependencies or libraries used for implementation (e.g., Python, PyTorch/TensorFlow versions, CUDA versions).
Experiment Setup	Yes	For vision tasks, we set K = 5 and m = 5, applying a 2-layer CNN for FEMNIST and Res Net-18 for CIFAR-10. For language modeling tasks, we set K = 3 and m = 3, using BERT-base for SENT-140 and GPT-2 for YELP which have billion of parameters. Details are in Table 1. The learned routed-experts importance α in Eq. 1 of FEMNIST, CIFAR, SENT140 and Yelp are 0.56, 0.38, 0.48 and 0.50. where we set β = 10 3 and discuss in Ablation Table 5.