reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

In Search of Forgotten Domain Generalization

Authors: Prasanna Mayilvahanan, Roland Zimmermann, Thaddäus Wiedemer, Evgenia Rusak, Attila Juhos, Matthias Bethge, Wieland Brendel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Training CLIP models on these datasets reveals that a significant portion of their performance is explained by in-domain examples. This indicates that the OOD generalization challenges from the Image Net era still prevail and that training on web-scale data merely creates the illusion of OOD generalization. Furthermore, through a systematic exploration of combining natural and rendition datasets in varying proportions, we identify optimal mixing ratios for model generalization across these domains. Our datasets and results re-enable meaningful assessment of OOD robustness at scale a crucial prerequisite for improving model robustness.
Researcher Affiliation	Academia	1University of Tübingen, 2Tübingen AI Center, 3Max-Planck-Institute for Intelligent Systems, Tübingen, 4ELLIS Institute Tübingen. Contact: EMAIL, EMAIL.
Pseudocode	No	The paper describes methods and processes in text and figures (e.g., 'Fig. 7: Labeling setup.'), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps in a code-like format.
Open Source Code	Yes	Code available at https://brendel-group.github.io/clip-dg/.
Open Datasets	Yes	To rigorously test whether CLIP s success in the rendition domain stems from their exposure during training, we first train a domain classifier to distinguish natural images from renditions (Sec. 3.2). By applying the domain classifier to a deduplicated version of LAION-400M, we create and release two datasets: LAION-Natural contains 57 M natural images; LAION-Rendition consists of 16 M renditions of scenes and objects.
Dataset Splits	Yes	To this end, we train classifiers on 13 000 labeled LAION-200M images, retaining 3000 samples each for a validation and test set. From the domain classification literature discussed in Sec. 2, we evaluate four methods with publicly available code. Overall, we label 19 000 random images from LAION-200M and 1000 images from each of the Image Net and Domain Net test sets (12 000 in total). We obtain LAION-Natural with roughly 57 million samples and LAION-Rendition with roughly 16 million samples. Fig. 3 shows random samples from both datasets, more samples are shown in Figs. 20 and 21. We also deploy the domain classifiers on the Image Net and Domain Net test sets to remove the domain contamination reported above and create clean test sets. The exact number of datapoints and the number of classes for each test set are detailed in Tab. 12.
Hardware Specification	Yes	For all our experiments, we train CLIP Vi T-B/32 (Dosovitskiy et al., 2020) from scratch for 32 epochs with a batch size of 16 384 on a single node with either four or eight A100 GPUs (training takes several days, depending on dataset size).
Software Dependencies	No	We use the implementation and hyperparameters provided by Ilharco et al. (2021). For training the CLIP models we used the publicly available code from (Ilharco et al., 2021) exclusively. While this refers to a specific implementation, it does not provide version numbers for any software components (e.g., Python, PyTorch, CUDA, or the specific version of OpenCLIP).
Experiment Setup	Yes	For all our experiments, we train CLIP Vi T-B/32 (Dosovitskiy et al., 2020) from scratch for 32 epochs with a batch size of 16 384 on a single node with either four or eight A100 GPUs (training takes several days, depending on dataset size). We use the implementation and hyperparameters provided by Ilharco et al. (2021). For the FT (Finetuning) model, as mentioned in Sec. 3.2, we finetune a CLIP Vi T-L/14 pretrained on LAION-2B with a linear readout. We finetune all models on 4 A100 GPUs, using a batch size of 256, weight decay of 5e 4, using an SGD optimizer, with step scheduler (0.1 every 20 epochs), at a learning rate of 0.1, for 50 epochs.