reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An analytic theory of creativity in convolutional diffusion models

Authors: Mason Kamb, Surya Ganguli

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We next test our theory on two CNN-based architectures, a standard UNet (Ronneberger et al., 2015) and a Res Net (He et al., 2016) trained on 4 datasets, MNIST, Fashion MNIST, CIFAR10, and Celeb A (see App. C.1 for details of architectures and training). We restrict our attention to these simple datasets because our theory is for CNN-based diffusion models only, and more complex diffusion models with attention and latent spaces are required to model more complex datasets. [...] For Res Nets, we find median r2 values between theory and experiment of 0.94 on MNIST, 0.90 on Fashion MNIST, 0.90 on CIFAR10, and 0.96 on Celeb A32x32.
Researcher Affiliation	Academia	1Department of Applied Physics, Stanford University, California, United States. Correspondence to: Mason Kamb <EMAIL>, Surya Ganguli <EMAIL>.
Pseudocode	No	The paper includes mathematical derivations and descriptions of methods, but it does not present any explicitly labeled pseudocode or algorithm blocks with structured steps formatted like code.
Open Source Code	Yes	1Code for the following experiments hosted at https://github.com/Kambm/convolutional diffusion
Open Datasets	Yes	trained on 4 datasets, MNIST, Fashion MNIST, CIFAR10, and Celeb A (see App. C.1 for details of architectures and training).
Dataset Splits	No	The paper mentions training on MNIST, Fashion MNIST, CIFAR10, and Celeb A datasets, which typically have standard splits. However, it does not explicitly provide specific percentages, sample counts for train/test/validation splits, or reference the use of 'standard splits' for these datasets within the main text or appendices for its own experiments. It mentions evaluating on '100 distinct random noise inputs'.
Hardware Specification	No	The paper describes the CNN-based architectures (UNet, Res Net) and general training parameters, but it does not specify any particular hardware used for running the experiments, such as GPU models, CPU types, or cloud computing environments with specifications.
Software Dependencies	No	The paper mentions using Adam optimizer and a cosine noise schedule, which are methods. It also implicitly uses frameworks like PyTorch (given the context of deep learning research and GitHub link), but it does not explicitly list any software dependencies with specific version numbers (e.g., Python version, PyTorch version, CUDA version, etc.).
Experiment Setup	Yes	For all experiments, we train each model for 300 epochs with Adam, using an initial learning rate of 1e-4, a batchsize of 128, and an exponential learning rate schedule that applies a multiplicative factor of 0.999965 to the learning rate with each step (this approximately halves the learning rate over the course of 50 epochs with our batch size of 128).