Does Data Scaling Lead to Visual Compositional Generalization?

Authors: Arnas Uselis, Andrea Dittadi, Seong Joon Oh

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Our experiments reveal a clear principle: compositional generalization is driven by data diversity, not mere data scale.
Researcher Affiliation Academia 1T ubingen AI Center, University of T ubingen 2Helmholtz AI 3Technical University of Munich 4Munich Center for Machine Learning (MCML) 5Max Planck Institute for Intelligent Systems, T ubingen. Correspondence to: Arnas Uselis <EMAIL>.
Pseudocode Yes Algorithm 1 Recovering factored concept representations for k = 2 concepts
Open Source Code Yes github.com/oshapio/visual-compositional-generalization We release our code and datasets publicly to promote reproducible research and responsible development of these capabilities.
Open Datasets Yes We use DSPRITES (Matthey et al., 2017) (using only heart shape to avoid symmetries), 3DSHAPES (Kim & Mnih, 2019), PUG (Bordes et al., 2023), COLOREDMNIST (Arjovsky et al., 2020), and a dataset we introduce of perceptually-challenging shapes without symmetries to which we refer as FSPRITES. Details in Appendix D. We release our code and datasets publicly to promote reproducible research and responsible development of these capabilities.
Dataset Splits Yes For each concept value i, we observe combinations with values j where (i j + n) mod n < k, and evaluate on all other combinations. This creates a clear distinction between combinations seen during training and those requiring compositional generalization. ... The training combinations (ci 1, ci 2) are drawn from the restricted subset Strain C1 C2. We refer to this as in-distribution (ID) data. (2) Testing: Evaluate on combinations from Stest = (C1 C2) \ Strain, i.e., concept pairs that never co-occurred during training. We refer to this as out-of-distribution (OOD) data.
Hardware Specification No The paper does not explicitly describe the hardware used for its experiments. It mentions models like RESNET-50 and DINO Vi T-L but no specific hardware specifications.
Software Dependencies No The paper mentions 'Adam (Kingma & Ba, 2017) optimizer' but does not provide specific version numbers for any software libraries or dependencies. It also references model architectures like RESNET-50 and ViT without associated software versions.
Experiment Setup Yes Optimization. All models are trained using the Adam (Kingma & Ba, 2017) optimizer. Based on an initial grid search, we use a learning rate of 10 4 for Res Net training from scratch and 10 3 for probing pre-trained features. All models are trained for 100 epochs with a batch size of 64.