How Compositional Generalization and Creativity Improve as Diffusion Models are Trained
Authors: Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, Matthieu Wyart
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate these questions in the context of diffusion models both theoretically and empirically. Theoretically, we consider a simple probabilistic context-free grammar a tree-like graphical model used to represent the hierarchical and compositional structure of data such as language and images. We test these predictions across different domains and find remarkable agreement: both generated texts and images achieve progressively larger coherence lengths as the training time or dataset size grows. We show empirically that the learning process of diffusion models trained on the RHM is hierarchical, progressively capturing compositional rules at deeper levels of the PCFG s hierarchy. |
| Researcher Affiliation | Academia | 1Institute of Physics, EPFL 2Institute of Electrical and Micro Engineering, EPFL 3Theoretical and Scientific Data Science, SISSA 4Department of Physics and Astronomy, Johns Hopkins University. On leave from EPFL. Correspondence to: <EMAIL>, <EMAIL>. |
| Pseudocode | No | The paper describes algorithms and methods but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format. |
| Open Source Code | No | Our experiments are based on the codebase of MD4 (Shi et al., 2024): https://github.com/google-deepmind/md4. Our experiments are based on the codebase of Improved DDPMs (Nichol & Dhariwal, 2021): https://github.com/openai/improved-diffusion. The paper refers to third-party codebases used for experiments but does not explicitly state that the authors are releasing their own code for the methodology described in this paper. |
| Open Datasets | Yes | We train MD4 [...] on the Open Web Text corpus (Gokaslan & Cohen, 2019). The model is trained for 10 epochs on Image Net 64 64 using the same hyperparameters as Nichol & Dhariwal (2021). |
| Dataset Splits | No | The model is trained for a full epoch on the training split ( 1010 tokens) using the same hyperparameters as Shi et al. (2024). The model is trained for 10 epochs on Image Net 64 64 using the same hyperparameters as Nichol & Dhariwal (2021). The paper mentions using 'training split' and 'validation set' for standard datasets like Open Web Text and ImageNet, but does not specify the exact percentages or sample counts for these splits within the text. |
| Hardware Specification | Yes | We train MD4 with batch size 64 and context size 1024 on 4 H100s for a single epoch. |
| Software Dependencies | No | The paper mentions specific models like MD4, Improved DDPMs, and LLaMA-2-7B, but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | The convolutional U-Net consists of L resolution blocks in both the encoder and decoder, with a filter size of s, stride of s, and 8192 channels. Each block uses Ge LU activation functions, and skip connections link encoder and decoder layers with the same resolution. The model is trained with SGD with a learning rate of 1, using a batch size of 32, and momentum parameter of 0.9. The diffusion process follows a linear schedule with 1,000 noise levels. We train MD4 with batch size 64 and context size 1024 on 4 H100s for a single epoch. In particular, we train a DDPM with 128 channels, 3 resolution blocks, 4000 diffusion steps, cosine noise schedule, learning rate 10 4 and batch size 128 for 10 epochs using a hybrid objective. |