reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mechanisms of Projective Composition of Diffusion Models

Authors: Arwen Bradley, Preetum Nakkiran, David Berthelot, James Thornton, Joshua M. Susskind

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and lengthgeneralization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including lengthgeneralization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to work . This paper starts to address these fundamental gaps. We begin by precisely defining one possible desired result of composition, which we call projective composition. Then, we investigate: (1) when linear score combinations provably achieve projective composition, (2) whether reverse-diffusion sampling can generate the desired composition, and (3) the conditions under which composition fails. We connect our theoretical analysis to prior empirical observations where composition has either worked or failed, for reasons that were unclear at the time. Finally, we propose a simple heuristic to help predict the success or failure of new compositions.
Researcher Affiliation	Industry	1Apple, Cupertino, CA, USA. Correspondence to: Arwen Bradley <EMAIL>, Preetum Nakkiran <EMAIL>.
Pseudocode	No	The paper describes mathematical definitions and theoretical results. It does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It mentions using SDXL and details in Appendix C, but this refers to the usage of existing models and not the release of their own implementation code.
Open Datasets	Yes	We used the CLEVR (Johnson et al., 2017) dataset generation procedure7 to generate datasets customized to the needs of the present work. All default objects, shapes, sizes, colors were kept unchanged. Images were generated in their original resolution of 320 x 240 and down-sampled to a lower resolution of 128 x 128 to facilitate experimentation and to be more GPU resources friendly. 7https://github.com/facebookresearch/clevr-dataset-gen
Dataset Splits	No	The paper specifies the number of samples in the generated datasets (e.g., "A background dataset (0 objects) with 50,000 samples"), and states "In all experiments, the model is trained with a batch size of 2048 over 128 x 220 samples by looping over the dataset as often as needed to reach that number." However, it does not explicitly describe training/test/validation dataset splits with percentages, counts, or a clear methodology for partitioning the data for evaluation.
Hardware Specification	Yes	In practice, training takes around 16 hours to complete on 32 A100 GPUs.
Software Dependencies	No	We used our own PyTorch re-implementation of the EDM2 (Karras et al., 2024) U-net architecture. Our re-implementation is functionally equivalent, and only differs in optimizations introduced to save memory and GPU cycles. We used the smallest model architecture, e.g. edm2-img64-xs from https://github.com/NVlabs/edm2.
Experiment Setup	Yes	In all experiments, the model is trained with a batch size of 2048 over 128 x 220 samples by looping over the dataset as often as needed to reach that number. In practice, training takes around 16 hours to complete on 32 A100 GPUs. We used almost the same training procedure as in EDM2 (Karras et al., 2024), which is basically a standard training loop with gradient accumulation. The only difference is that we do weight renormalization after the weights are updated rather than before as the authors originally did. For simplicity, we did not use posthoc-EMA to obtain the final weights used in inference. Instead we took the average of weights over the last 4096 training updates. The denoising procedure for inference is exactly the same as in EDM2 (Karras et al., 2024), e.g. 65 model calls using a 32-step Heun sampler.