Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?
Authors: Olawale Elijah Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the correlation between in-domain (ID) and out-of-domain (OOD) accuracy for benchmarks in the popular Domain Bed (Gulrajani & Lopez-Paz, 2020) and WILDS (Koh et al., 2021) benchmark suites, as well as subpopulation shift benchmarks, e.g., Water Birds (Sagawa et al., 2019). Table 1 highlights the prevalence of widely-used domain generalization benchmarks with accuracy on the line, a signature of potential misspecification, while Figure 4 qualitatively illustrates benchmarks with weak or strongly negative correlation between in and out-of-distribution accuracy. |
| Researcher Affiliation | Academia | Olawale Salaudeen EMAIL Massachusetts Institute of Technology Nicole Chiou EMAIL Stanford University Shiny Weng EMAIL Stanford University Sanmi Koyejo EMAIL Stanford University |
| Pseudocode | Yes | Algorithm 1: Generative Mechanism for Colored MNIST |
| Open Source Code | Yes | Code . We also provide a visualization tool to study future datasets: Link. (footnote: https://github.com/olawalesalaudeen/misspecified_DG_benchmarks.) |
| Open Datasets | Yes | Specifically, our results include the following datasets: Camelyon (Bandi et al., 2018; Koh et al., 2021), Civil Comments (Borkan et al., 2019; Koh et al., 2021), Colored MNIST (Arjovsky et al., 2019; Gulrajani & Lopez-Paz, 2020), Covid-CXR (Alzate-Grisales et al., 2022; Cohen et al., 2020b; Tabik et al., 2020; Tahir et al., 2021; Suwalska et al., 2023), FMo W (Christie et al., 2018; Koh et al., 2021), PACS (Li et al., 2017; Gulrajani & Lopez-Paz, 2020), Spawrious (Lynch et al., 2023), Terra Incognita (Beery et al., 2018; Gulrajani & Lopez-Paz, 2020), and Waterbirds (Sagawa et al., 2019). |
| Dataset Splits | Yes | Our experiments involve ID/OOD splits using a leave-one-domain-out approach. Specifically, for each domain indexed as i [1..E], we train on the subset Ei train = {e1, . . . , ei 1, ei+1, . . . , e E} and test on the held-out domain Ei test = {ei}. |
| Hardware Specification | No | For vision datasets, we leverage pretrained deep learning architectures such as Res Net-18/50 (He et al., 2016), Dense Net-121 (Huang et al., 2017), Vision Transformers (Dosovitskiy et al., 2020), and Conv Ne Xt-Tiny (Liu et al., 2022). For language datasets, we utilize pretrained embeddings from BERT (Devlin et al., 2019) and Distil BERT (Sanh et al., 2020), and apply lower-capacity machine learning classifiers, such as logistic regression, for downstream classification tasks. The paper does not mention any specific hardware used for running the experiments. |
| Software Dependencies | No | For vision datasets, we leverage pretrained deep learning architectures such as Res Net-18/50 (He et al., 2016), Dense Net-121 (Huang et al., 2017), Vision Transformers (Dosovitskiy et al., 2020), and Conv Ne Xt-Tiny (Liu et al., 2022). For language datasets, we utilize pretrained embeddings from BERT (Devlin et al., 2019) and Distil BERT (Sanh et al., 2020), and apply lower-capacity machine learning classifiers, such as logistic regression, for downstream classification tasks. The paper lists model architectures but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers. |
| Experiment Setup | Yes | Table 2: Models Generation Hyperparameter Range Learning Rate (lr) 10 5 to 10 3.5 Weight Decay 10 6 to 10 2 Batch Size 23 (8) to 25.5 ( ~45) Data Augmentation {True, False} Transfer Learning {True, False} Model Architecture {Res Net18, Res Net50, Dense Net121, Vi T-B-16, and Conv Ne Xt_Tiny} Dropout {0.0, 0.1, 0.5} Epoch |