reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Does Data Scaling Lead to Visual Compositional Generalization?

Authors: Arnas Uselis, Andrea Dittadi, Seong Joon Oh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Our experiments reveal a clear principle: compositional generalization is driven by data diversity, not mere data scale.
Researcher Affiliation	Academia	1T ubingen AI Center, University of T ubingen 2Helmholtz AI 3Technical University of Munich 4Munich Center for Machine Learning (MCML) 5Max Planck Institute for Intelligent Systems, T ubingen. Correspondence to: Arnas Uselis <EMAIL>.
Pseudocode	Yes	Algorithm 1 Recovering factored concept representations for k = 2 concepts
Open Source Code	Yes	github.com/oshapio/visual-compositional-generalization We release our code and datasets publicly to promote reproducible research and responsible development of these capabilities.
Open Datasets	Yes	We use DSPRITES (Matthey et al., 2017) (using only heart shape to avoid symmetries), 3DSHAPES (Kim & Mnih, 2019), PUG (Bordes et al., 2023), COLOREDMNIST (Arjovsky et al., 2020), and a dataset we introduce of perceptually-challenging shapes without symmetries to which we refer as FSPRITES. Details in Appendix D. We release our code and datasets publicly to promote reproducible research and responsible development of these capabilities.
Dataset Splits	Yes	For each concept value i, we observe combinations with values j where (i j + n) mod n < k, and evaluate on all other combinations. This creates a clear distinction between combinations seen during training and those requiring compositional generalization. ... The training combinations (ci 1, ci 2) are drawn from the restricted subset Strain C1 C2. We refer to this as in-distribution (ID) data. (2) Testing: Evaluate on combinations from Stest = (C1 C2) \ Strain, i.e., concept pairs that never co-occurred during training. We refer to this as out-of-distribution (OOD) data.
Hardware Specification	No	The paper does not explicitly describe the hardware used for its experiments. It mentions models like RESNET-50 and DINO Vi T-L but no specific hardware specifications.
Software Dependencies	No	The paper mentions 'Adam (Kingma & Ba, 2017) optimizer' but does not provide specific version numbers for any software libraries or dependencies. It also references model architectures like RESNET-50 and ViT without associated software versions.
Experiment Setup	Yes	Optimization. All models are trained using the Adam (Kingma & Ba, 2017) optimizer. Based on an initial grid search, we use a learning rate of 10 4 for Res Net training from scratch and 10 3 for probing pre-trained features. All models are trained for 100 epochs with a batch size of 64.