Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Authors: Olawale Elijah Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the correlation between in-domain (ID) and out-of-domain (OOD) accuracy for benchmarks in the popular Domain Bed (Gulrajani & Lopez-Paz, 2020) and WILDS (Koh et al., 2021) benchmark suites, as well as subpopulation shift benchmarks, e.g., Water Birds (Sagawa et al., 2019). Table 1 highlights the prevalence of widely-used domain generalization benchmarks with accuracy on the line, a signature of potential misspecification, while Figure 4 qualitatively illustrates benchmarks with weak or strongly negative correlation between in and out-of-distribution accuracy.
Researcher Affiliation Academia Olawale Salaudeen EMAIL Massachusetts Institute of Technology Nicole Chiou EMAIL Stanford University Shiny Weng EMAIL Stanford University Sanmi Koyejo EMAIL Stanford University
Pseudocode Yes Algorithm 1: Generative Mechanism for Colored MNIST
Open Source Code Yes Code . We also provide a visualization tool to study future datasets: Link. (footnote: https://github.com/olawalesalaudeen/misspecified_DG_benchmarks.)
Open Datasets Yes Specifically, our results include the following datasets: Camelyon (Bandi et al., 2018; Koh et al., 2021), Civil Comments (Borkan et al., 2019; Koh et al., 2021), Colored MNIST (Arjovsky et al., 2019; Gulrajani & Lopez-Paz, 2020), Covid-CXR (Alzate-Grisales et al., 2022; Cohen et al., 2020b; Tabik et al., 2020; Tahir et al., 2021; Suwalska et al., 2023), FMo W (Christie et al., 2018; Koh et al., 2021), PACS (Li et al., 2017; Gulrajani & Lopez-Paz, 2020), Spawrious (Lynch et al., 2023), Terra Incognita (Beery et al., 2018; Gulrajani & Lopez-Paz, 2020), and Waterbirds (Sagawa et al., 2019).
Dataset Splits Yes Our experiments involve ID/OOD splits using a leave-one-domain-out approach. Specifically, for each domain indexed as i [1..E], we train on the subset Ei train = {e1, . . . , ei 1, ei+1, . . . , e E} and test on the held-out domain Ei test = {ei}.
Hardware Specification No For vision datasets, we leverage pretrained deep learning architectures such as Res Net-18/50 (He et al., 2016), Dense Net-121 (Huang et al., 2017), Vision Transformers (Dosovitskiy et al., 2020), and Conv Ne Xt-Tiny (Liu et al., 2022). For language datasets, we utilize pretrained embeddings from BERT (Devlin et al., 2019) and Distil BERT (Sanh et al., 2020), and apply lower-capacity machine learning classifiers, such as logistic regression, for downstream classification tasks. The paper does not mention any specific hardware used for running the experiments.
Software Dependencies No For vision datasets, we leverage pretrained deep learning architectures such as Res Net-18/50 (He et al., 2016), Dense Net-121 (Huang et al., 2017), Vision Transformers (Dosovitskiy et al., 2020), and Conv Ne Xt-Tiny (Liu et al., 2022). For language datasets, we utilize pretrained embeddings from BERT (Devlin et al., 2019) and Distil BERT (Sanh et al., 2020), and apply lower-capacity machine learning classifiers, such as logistic regression, for downstream classification tasks. The paper lists model architectures but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers.
Experiment Setup Yes Table 2: Models Generation Hyperparameter Range Learning Rate (lr) 10 5 to 10 3.5 Weight Decay 10 6 to 10 2 Batch Size 23 (8) to 25.5 ( ~45) Data Augmentation {True, False} Transfer Learning {True, False} Model Architecture {Res Net18, Res Net50, Dense Net121, Vi T-B-16, and Conv Ne Xt_Tiny} Dropout {0.0, 0.1, 0.5} Epoch