reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Structure-informed Risk Minimization for Robust Ensemble Learning

Authors: Fengchun Qiao, Yanlin Chen, Xi Peng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SRM on two common Oo D generalization benchmarks, Domain Bed (Gulrajani & Lopez-Paz, 2020) and WILDS (Koh et al., 2021). Following the standard practice, we use a held-out validation set from training distributions on Domain Bed benchmark and validation distributions on WILDS benchmark for model selection. We provide implementation details and additional results in the Appendix. We provide the source code in the supplementary material. Baselines. We compare SRM with the following methods: (1) Uniform Ensemble; (2) Greedy Selection; (3) Empirical Risk Minimization (ERM) (Vapnik & Vapnik, 1998); (4) Uniform Prior; (5) Laplacian Prior; (6) Group Distributionally Robust Optimization (DRO) (Sagawa et al., 2019). These methods can be grouped into two categories: (1) Nonoptimization-based, where the ensemble weight is obtained without the need for optimization (Uniform Ensemble and Greedy Selection); (2) Optimization-based, where the ensemble weight is learned through an optimization process (ERM, Uniform Prior, Laplacian Prior and DRO). 4.1. Domain Bed Benchmark Datasets. We conduct experiments on five datasets: Terra Incognita (Beery et al., 2018), VLCS (Fang et al., 2013), Office Home (Venkateswara et al., 2017), PACS (Li et al., 2017), and Domain Net (Peng et al., 2019).
Researcher Affiliation	Academia	1Deep REAL Lab, Department of Computer and Information Sciences, University of Delaware, DE, USA. Correspondence to: Xi Peng <EMAIL>.
Pseudocode	Yes	Algorithm 1 Structure-informed Risk Minimization (SRM) Input: Data of Etrain, Step sizes ηw and ηq Output: Learned ensemble weights w // Construct graph G and compute prior p for i, j {1, . . . , n} do D(Pi, Pj) µi µj 2 2 + Σ1/2 i Σ1/2 j 2 F Aij D(Pi, Pj) end c(Pe) [Pn j=1 d(Pe, Pj)] 1 // Closeness centrality pe c(Pe)/ Pn j=1 c(Pj) // Prior distribution // Optimize weights Initialize w0 1 n1 while not converged do Calculate L(w, q) via Eq. 9 Update ensemble weights wt+1 via Eq. 10 Update mixture weights qt+1 via Eq. 11 end
Open Source Code	Yes	Code is available at: https: //github.com/deep-real/SRM.
Open Datasets	Yes	We evaluate SRM on two common Oo D generalization benchmarks, Domain Bed (Gulrajani & Lopez-Paz, 2020) and WILDS (Koh et al., 2021). ... We conduct experiments on five datasets: Terra Incognita (Beery et al., 2018), VLCS (Fang et al., 2013), Office Home (Venkateswara et al., 2017), PACS (Li et al., 2017), and Domain Net (Peng et al., 2019). ... We evaluate SRM on FMo W-WILDS (Koh et al., 2021) dataset, which comprises satellite images collected from different geographical regions across five continents at different time.
Dataset Splits	Yes	Following the standard practice, we use a held-out validation set from training distributions on Domain Bed benchmark and validation distributions on WILDS benchmark for model selection. ... For each dataset, we hold one distribution out for test and train on the remaining ones, and report the average accuracies over all test distributions. ... Apart from the original train-test split scheme (Test After 2016), where training distributions consist of years 2002 to 2013, test distributions consists of years 2016 and 2017, and years 2013 to 2016 are reserved for validation, we further propose two train-test split schemes which cover more diverse distribution shift scenarios: (1) Test Before 2004, where years 2007 to 2018 are for training, 2002 to 2004 are for testing, 2004 to 2007 are for validation; (2) Test Middle, where years 2002 to 2008 and years 2012 to 2018 are for training, 2009-2011 are for testing, 2008 and 2011 are for validation.
Hardware Specification	No	The paper does not explicitly provide details about the specific hardware used for running its experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	We use DiWA (Rame et al., 2022) to train the models in the ensemble pool. Each model in the ensemble pool is a ResNet50 (He et al., 2016) model trained with ERM (Vapnik & Vapnik, 1998) using different hyper-parameter settings. ... For optimizing w and q, we use SGD optimizer. The paper mentions software tools and frameworks but does not provide specific version numbers for them.
Experiment Setup	Yes	The number of models (n) used in the experiments is 10. A random model in the ensemble pool is chosen to construct the distribution graph. For optimizing w and q, we use SGD optimizer. For the experiments on Domain Bed, we set ηw = 0.1 and ηq = 0.1, and for WILDS, we set ηw = 3e-2 and ηq = 0.1. λ is selected from [0.0, 2.0] for each dataset. We use in-distribution validation set to optimize w and q, and the number of steps is 100 and 50 for Domain Bed and WILDS, respectively.