Mechanisms of Projective Composition of Diffusion Models
Authors: Arwen Bradley, Preetum Nakkiran, David Berthelot, James Thornton, Joshua M. Susskind
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and lengthgeneralization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including lengthgeneralization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to work . This paper starts to address these fundamental gaps. We begin by precisely defining one possible desired result of composition, which we call projective composition. Then, we investigate: (1) when linear score combinations provably achieve projective composition, (2) whether reverse-diffusion sampling can generate the desired composition, and (3) the conditions under which composition fails. We connect our theoretical analysis to prior empirical observations where composition has either worked or failed, for reasons that were unclear at the time. Finally, we propose a simple heuristic to help predict the success or failure of new compositions. |
| Researcher Affiliation | Industry | 1Apple, Cupertino, CA, USA. Correspondence to: Arwen Bradley <EMAIL>, Preetum Nakkiran <EMAIL>. |
| Pseudocode | No | The paper describes mathematical definitions and theoretical results. It does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions using SDXL and details in Appendix C, but this refers to the usage of existing models and not the release of their own implementation code. |
| Open Datasets | Yes | We used the CLEVR (Johnson et al., 2017) dataset generation procedure7 to generate datasets customized to the needs of the present work. All default objects, shapes, sizes, colors were kept unchanged. Images were generated in their original resolution of 320 x 240 and down-sampled to a lower resolution of 128 x 128 to facilitate experimentation and to be more GPU resources friendly. 7https://github.com/facebookresearch/clevr-dataset-gen |
| Dataset Splits | No | The paper specifies the number of samples in the generated datasets (e.g., "A background dataset (0 objects) with 50,000 samples"), and states "In all experiments, the model is trained with a batch size of 2048 over 128 x 220 samples by looping over the dataset as often as needed to reach that number." However, it does not explicitly describe training/test/validation dataset splits with percentages, counts, or a clear methodology for partitioning the data for evaluation. |
| Hardware Specification | Yes | In practice, training takes around 16 hours to complete on 32 A100 GPUs. |
| Software Dependencies | No | We used our own PyTorch re-implementation of the EDM2 (Karras et al., 2024) U-net architecture. Our re-implementation is functionally equivalent, and only differs in optimizations introduced to save memory and GPU cycles. We used the smallest model architecture, e.g. edm2-img64-xs from https://github.com/NVlabs/edm2. |
| Experiment Setup | Yes | In all experiments, the model is trained with a batch size of 2048 over 128 x 220 samples by looping over the dataset as often as needed to reach that number. In practice, training takes around 16 hours to complete on 32 A100 GPUs. We used almost the same training procedure as in EDM2 (Karras et al., 2024), which is basically a standard training loop with gradient accumulation. The only difference is that we do weight renormalization after the weights are updated rather than before as the authors originally did. For simplicity, we did not use posthoc-EMA to obtain the final weights used in inference. Instead we took the average of weights over the last 4096 training updates. The denoising procedure for inference is exactly the same as in EDM2 (Karras et al., 2024), e.g. 65 model calls using a 32-step Heun sampler. |