reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Clustering-Based Validation Splits for Model Selection under Domain Shift

Authors: Andrea Napoli, Paul White

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, the technique consistently outperforms alternative splitting strategies across a range of datasets and training algorithms, for both domain generalisation and unsupervised domain adaptation tasks. Analysis also shows the MMD between the training and validation sets to be well-correlated with test domain accuracy, further substantiating the validity of this approach. ... 5 Experiments
Researcher Affiliation	Academia	Andrea Napoli & Paul White EMAIL Institute of Sound and Vibration Research University of Southampton, UK
Pseudocode	Yes	Algorithm 1 Constrained kernel k-means clustering
Open Source Code	No	The paper does not explicitly provide a link to the source code for the described methodology, nor does it contain a clear statement about its release or availability in supplementary materials. The Open Review link provided is for a review forum, not a code repository.
Open Datasets	Yes	Camelyon17-WILDS (Bándi et al., 2019; Koh et al., 2021) tumour detection in tissue samples across 5 hospitals... License: CC0. ... SVIRO (Dias Da Cruz et al., 2020) classification of vehicle rear seat occupancy... License: CC BY-NC-SA 4.0. ... Terra Incognita (Beery et al., 2018) classification of wild animals... License: CDLA-Permissive 1.0.
Dataset Splits	Yes	S must be partitioned into training and validation sets, T and V respectively. ... T and V should be of sizes determined by a user-defined holdout fraction h satisfying 0 < h < 1... Table 6: Holdout fraction 0.2 UDA holdout fraction 0.5. ... Every domain is tested 3 times for reproducibility, each time with a different random seed for model initialisation, hyperparameter search and other stochastic variables. The reported accuracy values are averages over all domains and repeats.
Hardware Specification	No	The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work. ... In total, the experiments involve training 5,160 models, requiring around 100 GPU-days of computation. These statements indicate the use of computing resources but lack specific hardware details such as GPU/CPU models.
Software Dependencies	Yes	Experiments are conducted using the Domain Bed framework (Gulrajani & Lopez Paz, 2021). This means all-but-one of the domains are placed in the development set... The Gurobi Optimizer (Gurobi Optimization LLC, 2023) is used to solve the LPs.
Experiment Setup	Yes	Table 6: General parameter values and training details for the experiments. Experimental parameter Value Hyperparameter random search size 10 Number of trials 3 Holdout fraction 0.2 UDA holdout fraction 0.5 Number of training steps 3000 Gaussian kernel bandwidth 1 Finetuning iterations before split 3000 Nyström subset size (if applicable) 2000 Architecture Res Net-18 Class balanced True