When does compositional structure yield compositional generalization? A kernel theory.
Authors: Samuel Lippl, Kimberly Stachenfeld
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a theory of compositional generalization in kernel models with fixed, compositionally structured representations. This provides a tractable framework for characterizing the impact of training data statistics on generalization. We find that these models are limited to functions that assign values to each combination of components seen during training, and then sum up these values (conjunction-wise additivity). This imposes fundamental restrictions on the set of tasks compositionally structured kernel models can learn, in particular preventing them from transitively generalizing equivalence relations. Even for compositional tasks that they can learn in principle, we identify novel failure modes in compositional generalization (memorization leak and shortcut bias) that arise from biases in the training data. Finally, we empirically validate our theory, showing that it captures the behavior of deep neural networks (convolutional networks, residual networks, and Vision Transformers) trained on a set of compositional tasks with similarly structured data. |
| Researcher Affiliation | Collaboration | Samuel Lippl Center for Theoretical Neuroscience Columbia University New York, NY, USA EMAIL Kimberly Stachenfeld Google Deep Mind and Center for Theoretical Neuroscience Columbia University New York, NY, USA EMAIL |
| Pseudocode | No | The paper describes mathematical derivations and theoretical findings but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code required to reproduce all experiments can be found under https://github.com/sflippl/compositional-generalization. |
| Open Datasets | Yes | Deep networks trained on MNIST and CIFAR versions of compositional tasks. |
| Dataset Splits | Yes | After training models on certain combinations of components Ztrain Z = QC c=1 Zc, we assess generalization on all other combinations Ztest := Z \ Ztrain. |
| Hardware Specification | No | The paper discusses software frameworks like Pytorch and Pytorch Lightning and types of neural networks (Conv Nets, Res Nets, Vi Ts) but does not provide any specific details about the hardware (e.g., GPU or CPU models) used for training or running experiments. |
| Software Dependencies | No | All networks were trained with Pytorch and Pytorch Lightning Paszke et al. (2019). We fit the kernel models by hand-specifying the kernel and fitting either a support vector regression or classification using scikit-learn (Pedregosa et al., 2011). |
| Experiment Setup | Yes | We consider Re LU networks with one hidden layer and H = 1000 units. We initialize by σ p 2/H, considering σ [10 6, 1]... We considered networks with four convolutional layers (kernel size is five, two layers have 32 filters, two have 64 filters) and two densely connected layers (with 512 and 1024 units)... We trained these networks with SGD using a learning rate of 10 4 and momentum of 0.9... We trained a residual neural network with eight blocks in total... using the Adam optimizer with a learning rate of 10 3 for 100 epochs... Finally, we trained a Vision Transformer (Vi T) with six attention heads, 256 dimensions for both the attention layer and the MLP, and a depth of four, using Adam with a learning rate of 10 4 for 200 epochs. |