When and How Does CLIP Enable Domain and Compositional Generalization?

Authors: Elias Kempf, Simon Schrodi, Max Argus, Thomas Brox

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that domain diversity is essential for both domain and compositional generalization... To answer such questions, we construct fully controllable experimental conditions that allow precise and systematic manipulation... We complement these experiments with in-depth data-centric and mechanistic analyses... Figure 2 summarizes the results for CLIP models with Res Net-50 vision encoder and trained on Image Net-Captions as base dataset.
Researcher Affiliation Academia 1University of Freiburg. Correspondence to: Elias Kempf <EMAIL>, Simon Schrodi <EMAIL>.
Pseudocode Yes Further details on this computation and pseudocode are provided in Appendix C.1. Algorithm 1 get topk features Algorithm 2 measure feature sharing
Open Source Code Yes For reproducibility, our code is available at https://github.com/ lmb-freiburg/understanding-clip-ood.
Open Datasets Yes Specifically, we augmented a base dataset consisting primarily of natural images, such as Image Net-Captions (Fang et al., 2022), with non-natural samples from Domain Net (Peng et al., 2019)... We used either Image Net-Captions (Fang et al., 2022), CC3M (Sharma et al., 2018), or CC12M (Changpinyo et al., 2021) as our base image-text datasets D0... For the domain-specific image-text pairs Dr (blue, green, orange in Figure 1), we used Domain Net s non-natural domains: Clipart, Infograph, Painting, Quickdraw, and Sketch (Peng et al., 2019). For this, we used Image Net-Sketch (Wang et al., 2019)...
Dataset Splits Yes Our test data for all training data setups consists of the same novel combinations of the classes C2 of the test domain Di: DC2 i ( ). By keeping the test set fixed throughout all conditions, we ensure comparability across these setups... Notations Let D0 denote the base domain mostly consisting of natural images with image-text pairs (I0 i , T 0 i ) (red in Figure 1). Further, we consider m non-natural domains Dr for r {1, . . . , m} with image-text pairs (Ir i , T r i ) (blue, green, orange). Lastly, we consider the object classes C = {c1, . . . , cn} for the images I which we divide into two disjoint subsets C1 = {c1, . . . , ck} (squares ) and C2 = {ck+1, . . . , cn} (circles ). We denote the subset of Dr that contains only classes from C as DC r . Finally, the class choices for C1 and C2 are provided in Appendix A.4. Appendix A.4: Our final selection of classes is C2 = {aircraft carrier, axe, banana, barn, bed, candle, lion, mountain, necklace, penguin, pizza, saxophone, television, tractor, traffic light}. and C1 are the remaining 330 classes of Domain Net.
Hardware Specification Yes We conducted our experiments mainly on NVIDIA RTX 2080 GPUs and estimated the total GPU hours to be approximately 25 000.
Software Dependencies No The paper mentions several software components like "Open CLIP (Cherti et al., 2023)", "LAVIS (Li et al. (2023))", "nnsight (Fiotto-Kaufman et al., 2025)", but none of these mentions provide explicit version numbers for the software components, which is critical for reproducibility.
Experiment Setup Yes We trained the models for 32 epochs with a batch size of 1024 with Adam W (learning rate of 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e-8, weight decay of 0.2) and cosine annealing learning rate scheduling with 500 warmup steps.