reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Authors: Olawale Elijah Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the correlation between in-domain (ID) and out-of-domain (OOD) accuracy for benchmarks in the popular Domain Bed (Gulrajani & Lopez-Paz, 2020) and WILDS (Koh et al., 2021) benchmark suites, as well as subpopulation shift benchmarks, e.g., Water Birds (Sagawa et al., 2019). Table 1 highlights the prevalence of widely-used domain generalization benchmarks with accuracy on the line, a signature of potential misspecification, while Figure 4 qualitatively illustrates benchmarks with weak or strongly negative correlation between in and out-of-distribution accuracy.
Researcher Affiliation	Academia	Olawale Salaudeen EMAIL Massachusetts Institute of Technology Nicole Chiou EMAIL Stanford University Shiny Weng EMAIL Stanford University Sanmi Koyejo EMAIL Stanford University
Pseudocode	Yes	Algorithm 1: Generative Mechanism for Colored MNIST
Open Source Code	Yes	Code . We also provide a visualization tool to study future datasets: Link. (footnote: https://github.com/olawalesalaudeen/misspecified_DG_benchmarks.)
Open Datasets	Yes	Specifically, our results include the following datasets: Camelyon (Bandi et al., 2018; Koh et al., 2021), Civil Comments (Borkan et al., 2019; Koh et al., 2021), Colored MNIST (Arjovsky et al., 2019; Gulrajani & Lopez-Paz, 2020), Covid-CXR (Alzate-Grisales et al., 2022; Cohen et al., 2020b; Tabik et al., 2020; Tahir et al., 2021; Suwalska et al., 2023), FMo W (Christie et al., 2018; Koh et al., 2021), PACS (Li et al., 2017; Gulrajani & Lopez-Paz, 2020), Spawrious (Lynch et al., 2023), Terra Incognita (Beery et al., 2018; Gulrajani & Lopez-Paz, 2020), and Waterbirds (Sagawa et al., 2019).
Dataset Splits	Yes	Our experiments involve ID/OOD splits using a leave-one-domain-out approach. Specifically, for each domain indexed as i [1..E], we train on the subset Ei train = {e1, . . . , ei 1, ei+1, . . . , e E} and test on the held-out domain Ei test = {ei}.
Hardware Specification	No	For vision datasets, we leverage pretrained deep learning architectures such as Res Net-18/50 (He et al., 2016), Dense Net-121 (Huang et al., 2017), Vision Transformers (Dosovitskiy et al., 2020), and Conv Ne Xt-Tiny (Liu et al., 2022). For language datasets, we utilize pretrained embeddings from BERT (Devlin et al., 2019) and Distil BERT (Sanh et al., 2020), and apply lower-capacity machine learning classifiers, such as logistic regression, for downstream classification tasks. The paper does not mention any specific hardware used for running the experiments.
Software Dependencies	No	For vision datasets, we leverage pretrained deep learning architectures such as Res Net-18/50 (He et al., 2016), Dense Net-121 (Huang et al., 2017), Vision Transformers (Dosovitskiy et al., 2020), and Conv Ne Xt-Tiny (Liu et al., 2022). For language datasets, we utilize pretrained embeddings from BERT (Devlin et al., 2019) and Distil BERT (Sanh et al., 2020), and apply lower-capacity machine learning classifiers, such as logistic regression, for downstream classification tasks. The paper lists model architectures but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers.
Experiment Setup	Yes	Table 2: Models Generation Hyperparameter Range Learning Rate (lr) 10 5 to 10 3.5 Weight Decay 10 6 to 10 2 Batch Size 23 (8) to 25.5 ( ~45) Data Augmentation {True, False} Transfer Learning {True, False} Model Architecture {Res Net18, Res Net50, Dense Net121, Vi T-B-16, and Conv Ne Xt_Tiny} Dropout {0.0, 0.1, 0.5} Epoch