reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Finding Competence Regions in Domain Generalization

Authors: Jens Müller, Stefan T. Radev, Robert Schmier, Felix Draxler, Carsten Rother, Ullrich Koethe

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a comprehensive experimental evaluation of existing proxy scores as incompetence scores for classification and highlight the resulting trade-offs between rejection rate and accuracy gain. For comparability with prior work, we focus on standard DG benchmarks and consider the effect of measuring incompetence via different learned representations in a closed versus an open world setting. Our results suggest that increasing incompetence scores are indeed predictive of reduced accuracy, leading to significant improvements of the average accuracy below a suitable incompetence threshold.
Researcher Affiliation	Collaboration	Jens Müller EMAIL Informatics for Life, Heidelberg University, Germany Stefan T. Radev EMAIL STRUCTURES Cluster of Excellence, Heidelberg University, Germany Robert Schmier EMAIL Bosch Center for Artificial Intelligence, Renningen, Germany Heidelberg University, Germany Felix Draxler EMAIL-heidelberg. Heidelberg University, Germany Carsten Rother EMAIL Heidelberg University, Germany Ullrich Köthe EMAIL Heidelberg University, Germany
Pseudocode	No	The paper describes methods and algorithms but does not provide any structured pseudocode or algorithm blocks. The steps are described in narrative text.
Open Source Code	Yes	We provide access to our code under https://github.com/Xarwin M/competence_estimation
Open Datasets	Yes	These models are trained on six domain generalization data sets from the Domain Bed repository (Gulrajani & Lopez-Paz, 2020): PACS (Li et al., 2017), Office Home (Venkateswara et al., 2017), VLCS (Fang et al., 2013), Terra Incognita (Beery et al., 2018), Domain Net (Peng et al., 2019) and SVIRO (Cruz et al., 2020).
Dataset Splits	Yes	We train a classifier on all but one domain. The one left out during training is then the OOD test domain where the competence region is evaluated. As an example consider the DG task behind the earlier example in Figure 2: If we train a model on the domains Photos, Art images, and Sketches, the DG task asks for an accurate model on the domain Cartoons which constitute the OOD test domain (see Figure 2). Overall we consider 32 DG tasks which result in 288 trained networks. We then compute the incompetence scores of each trained network In Section 3.1, we describe the process of calculating the incompetence scores. For each DG task, we distinguish four data sets. For the ID distribution, we consider a training set, a validation set for hyperparameter optimization, and a test set that has no influence on the optimization process for the subsequent evaluation.
Hardware Specification	No	The acknowledgments mention: "We thank the Center for Information Services and High Performance Computing (ZIH) at TU Dresden for its facilities for high throughput calculations." This provides a general computing environment but lacks specific hardware details like GPU/CPU models, memory, or other detailed specifications.
Software Dependencies	No	We use all the standard settings provided in the Domain Bed repository 2. We train three different neural network architectures with Emprirical-Risk-Minimization, shortly ERM (Vapnik, 1999). Namely, a Res Net based architecture (He et al., 2016), a Vision Transformer (Dosovitskiy et al., 2020) and a Swin Transformer (Liu et al., 2021). If we just refer to ERM, we mean the Res Net-based architecture. Furthermore, we train classifiers with various recent DG algorithms, namely Fish (Shi et al., 2021), Group DRO (Sagawa et al., 2019), SD (Pezeshki et al., 2021), Sag Net (Nam et al., 2021), Mixup (Yan et al., 2020) and VREx (Krueger et al., 2021). While software tools and architectures are mentioned, specific version numbers for libraries or frameworks are not provided in the paper's text.
Experiment Setup	Yes	Training details and hyperparameter settings are listed in Appendix A.5. ... We use all the standard settings provided in the Domain Bed repository and train all classifiers with hyperparameters proposed in the repository. The Vision Transformer and Swin Transformer are trained with hyperparameters found useful on these data sets and architectures as in (Wenzel et al., 2022). Each model is trained for 100 epochs on the smaller data sets (PACS, VLCS, Terra Incognita and Office Home) and for 10 epochs on Domain Net and SVIRO. When no improvement in terms of accuracy on the validation set is achieved, we stop the training. The best model is chosen due to the accuracy on the ID distribution measured via the accuracy on the validation set.