reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Is Large-scale Pretraining the Secret to Good Domain Generalization?

Authors: Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Bryan Plummer, Kate Saenko

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments confirm the Alignment Hypothesis is true, and we use it as an analysis tool of existing DG methods evaluated on Domain Bed datasets by splitting evaluation data into In-pretraining (IP) and Out-of-pretraining (OOP). We evaluate Domain Generalization (DG) performance across both pretraining-aligned (IP) and pretraining-misaligned (OOP) data.
Researcher Affiliation	Academia	1Boston University 2MIT Lincoln Laboratory EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Evaluating the Image Similarity Hypothesis Require: Target domain samples Dtarget, trained DG model M, pre-trained image encoder f I, 1: for each sample I Dtarget do 2: Retrieve nearest neighbor of I in LAION-400M using f I features, assign to Ik 3: Compute Perceptual Similarity Score(I, Ik) using Equation Eq. 1, 4: Record correctness of M(I) 5: end for 6: Bin samples based on Perceptual Similarity Score 7: Compute DG accuracy within each bin 8: return Accuracy for each bin
Open Source Code	No	The paper mentions using a "slightly modified MIRO (Cha et al., 2022) codebase for training and evaluation" and "directly use the author's implementation and hyper-parameters" for other methods, but does not provide an explicit statement or link for the release of their own source code for the methodology described in this paper. While they release a dataset, they do not state that their specific code is open-source.
Open Datasets	Yes	We release Domain Bed-OOP at https://huggingface.co/datasets/ PTeterwak/Domain Bed_OOP. We apply this approach to five widely-used Domain Bed (Gulrajani & Lopez Paz, 2020) DG datasets: VLCS (Fang et al., 2013), PACS (Li et al., 2017), Office Home (Ganin et al., 2016), Terra Incognita (Beery et al., 2018), and Domain Net (Peng et al., 2019).
Dataset Splits	Yes	We use leave-one-out evaluation, where a model is trained on all domains except the evaluation domain. After filtering, we focus on determining a threshold to split the dataset into In-Pretraining (IP) and Out-of-Pretraining (OOP) subsets. We select 0.21 as the threshold...
Hardware Specification	Yes	Each run uses an A6000 48GB GPU, trained for up to 12 hours per domain-dataset combination.
Software Dependencies	No	The paper mentions using "Open CLIP-Vi T-B/16" but does not specify a software version for it, nor does it list versions for other key software components like programming languages or deep learning frameworks.
Experiment Setup	Yes	We use default hyper-parameters as defined by (Cha et al., 2022). This includes a learning rate of 5e-5, weight decay of 0.0, a batch size of 32 per-domain, an Adam Optimizer, and no dropout for all methods. For SWAD, we use an optimum patience parameter value of 3, overfit patience parameter value of 6, and tolerance ratio of 6. For MIRO, we use use regularizer loss weight of 1.0. For CORAL, we use a CORAL regularizer weight of 1.0, following (Cha et al., 2021). For LP-FT, we train the linear probe for 600 steps before unlocking the full backbone. For Model Parameter Averaging, we burn in the training for 600 steps before averaging iterates.