reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Federated Domain Generalization with Decision Insight Matrix

Authors: Tianchi Liao, Binghui Xie, Lele Fu, Sheng Huang, Bowen Deng, Chuan Chen, Zibin Zheng

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate our approach, we conducted experiments in four datasets, Rotated MNIST [Ghifary et al., 2015], is a MNIST dataset of 7000 samples by rotating it at angles of 0 , 15 , 30 , 45 , 60 , and 75 , resulting in six different domains. PACS [Li et al., 2017], has 9991 images consisting of seven object categories in four domains (photo, art, cartoon, and sketch). VLCS [Fang et al., 2013], has 10729 images consisting of five object categories in four domains (Caltech101, Label Me, SUN09, and VOC2007). Office Home [Venkateswara et al., 2017], is an image recognition dataset that includes 15,588 images of 65 classes from four different domains (art, clipart, product, and real-world). These are commonly used in the literature for domain generalization. We adhere to the experimental methodology outlined in Fed IIR [Guo et al., 2023]. For all datasets, we perform leave-one-domain-out strategy [Gulrajani and Lopez Paz, 2020], where we choose one domain as the test domain, train the model on all remaining domains, and evaluate it on the chosen domain. Each source domain is treated as a client. Following standard practice, we use 90% of available data as training data and 10% as validation data. Considering the FL setting, we explore two scenarios based on the number of clients: the one-domain-one-client scenario and the one-domain-multiple-clients scenario. In the one-domain-one-client scenario, each training domain is treated as an individual client. In the one-domain-multiple-clients scenario, data from each training domain is randomly partitioned into multiple subsets, with each client containing data from one subset of a given training domain. The details of the data partitioning are provided in the Appendix C.1. Baselines. We consider 2 classic federated methods Fed Avg [Mc Mahan et al., 2017], Fed Prox [Li et al., 2020b], and 4 state-of-the-art federated methods for domain generalization Fed ADG [Zhang et al., 2021], Fed SR [Nguyen et al., 2022], Fed IIR [Guo et al., 2023] and Fed LGF [Yan and Guo, 2025] as baselines. Implementation. We design dataset-specific models for each task. For the Rotated MNIST dataset, the feature encoder consists of four convolutional blocks, with Re LU activation, group normalization, and average pooling, followed by a linear classifier. During training, the batch size is 64. For the VLCS and PACS datasets, Res Net-18 is used as the feature encoder, while Res Net-50 is employed for the Office Home dataset. The classifiers for these three datasets consist of two fully connected layers. During training, the batch size is 32. For all datasets and scenarios, we set the communication rounds T to 100, with local iteration per round E=1 to accommodate limited local computational resources. Local models are updated using the SGD optimizer with a momentum of 0.9. The best parameters reported in the original paper were selected for the baseline, and the optimal hyperparameters of Fed DIM were found by grid search. Each experiment was repeated 3 times and the average value was calculated. Our proposed method consistently outperforms other state-of-the-art baselines. In terms of average accuracy, we outperform the latest baseline Fed LGF by 1.47% across all datasets. These observations validate the effectiveness of our method compared to existing baselines. As the total number of clients increases, the performance of all methods declines significantly in the one-domainmulti-client scenario.
Researcher Affiliation	Academia	1Sun Yat-sen University, Guangzhou, China 2The Chinese University of Hong Kong, Hong Kong EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Fed DIM Input: total rounds T, local epochs E, total number of clients M, sampled number of clients C, learning rate η, hyperparameter for loss λ Server executes: 1: Initialize global model θ and global insight matrix I 2: for each round t = 1 T do 3: Server samples subset C of clients 4: for each client c C in parallel do 5: {θt c, Ic} Clients updates(θt, It) 6: end for 7: Update global model and calculate the global insight matrix by Eq. (5) 8: Update global insight matrix It+1 by Eq. (6) 9: end for Clients updates: 1: Initialize local model θt c = θt 2: for each local epoch e = 1 E do 3: Sample mini-batch in B: 4: Calculate the n-th sample insight matrix In 5: Calculate local loss by Eq. (7) 6: Update local model: θt c θt c η Lc (θt c; Bc) 7: end for 8: Calculate the mean class insight matrix Ic by Eq. (4) 9: return θt c and Ic
Open Source Code	No	The paper does not contain any explicit statements about open-sourcing the code or provide a link to a code repository.
Open Datasets	Yes	Datasets. To evaluate our approach, we conducted experiments in four datasets, Rotated MNIST [Ghifary et al., 2015], is a MNIST dataset of 7000 samples by rotating it at angles of 0 , 15 , 30 , 45 , 60 , and 75 , resulting in six different domains. PACS [Li et al., 2017], has 9991 images consisting of seven object categories in four domains (photo, art, cartoon, and sketch). VLCS [Fang et al., 2013], has 10729 images consisting of five object categories in four domains (Caltech101, Label Me, SUN09, and VOC2007). Office Home [Venkateswara et al., 2017], is an image recognition dataset that includes 15,588 images of 65 classes from four different domains (art, clipart, product, and real-world). These are commonly used in the literature for domain generalization.
Dataset Splits	Yes	For all datasets, we perform leave-one-domain-out strategy [Gulrajani and Lopez Paz, 2020], where we choose one domain as the test domain, train the model on all remaining domains, and evaluate it on the chosen domain. Each source domain is treated as a client. Following standard practice, we use 90% of available data as training data and 10% as validation data.
Hardware Specification	No	The paper mentions model architectures like "Conv Net", "Res Net-18", and "Res Net-50", and training parameters like "batch size", "communication rounds", "local iteration per round", and "SGD optimizer", but does not specify any hardware details (e.g., GPU/CPU models, memory).
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	For all datasets and scenarios, we set the communication rounds T to 100, with local iteration per round E=1 to accommodate limited local computational resources. Local models are updated using the SGD optimizer with a momentum of 0.9. The best parameters reported in the original paper were selected for the baseline, and the optimal hyperparameters of Fed DIM were found by grid search. Each experiment was repeated 3 times and the average value was calculated. We investigated the effects of momentum coefficient and the loss trade-off parameter in a one-domain-multi-client scenario. We evaluated the sensitivity of the model using four datasets in the range λ {0.0001, 0.001, 0.01, 0.1, 1} and m {0.1, 0.3, 0.5, 0.7, 0.9}, as shown in Figure 6. The results show that Fed DIM performs consistently when λ {0.0001, 0.001, 0.01, 0.1}, but performance degrades significantly at λ = 1. Furthermore, models with static momentum (m = 0 or m = 1) perform worse than those with dynamic momentum updates, highlighting the importance of momentum in improving generalization.