On the Feature Learning in Diffusion Models

Authors: Andi Han, Wei Huang, Yuan Cao, Difan Zou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical analysis demonstrates that diffusion models, due to the denoising objective, are encouraged to learn more balanced and comprehensive representations of the data. In contrast, neural networks with a similar architecture trained for classification tend to prioritize learning specific patterns in the data, often focusing on easy-to-learn components. To support these theoretical insights, we conduct several experiments on both synthetic and real-world datasets, which empirically validate our findings and highlight the distinct feature learning dynamics in diffusion models compared to classification.
Researcher Affiliation Academia RIKEN AIP (EMAIL, EMAIL). Equal contribution. Department of Statistics and Actuarial Science, University of Hong Kong (EMAIL) Department of Computer Science and Institute of Data Science, University of Hong Kong (EMAIL)
Pseudocode No The paper describes theoretical frameworks and experimental setups but does not include any clearly labeled pseudocode or algorithm blocks. It presents mathematical derivations and logical steps in paragraph form, particularly in the proof overview and appendix sections, rather than structured algorithm displays.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository in the main text or appendix.
Open Datasets Yes We conduct both synthetic and real-world experiments to verify our theoretical claims. ... We also conduct experiments on the MNIST dataset (Lecun et al., 1998) to support our theory. In order to better control the signal-to-noise ratio, we create a Noisy-MNIST dataset, where we treat each original MNIST image as a clean signal patch and concatenate a standard Gaussian noise patch with the same size, i.e., 28 × 28.
Dataset Splits Yes Setup. We follow Definition 2.1 to generate a synthetic dataset for both diffusion model and classification. Specifically, we set data dimension d = 1000 and let µ1 = [µ, 0, . . . , 0] Rd and µ−1 = [0, µ, 0, . . . , 0] Rd. We sample the noise patch ξi ∼ N(0, Id), i ∈ [n] (i.e., σξ = 1). We set sample size and network width to be n = 30 and m = 20... The (in-distribution) test accuracy is computed with 3000 test samples. ... We also conduct experiments on the MNIST dataset (Lecun et al., 1998) to support our theory. In order to better control the signal-to-noise ratio, we create a Noisy-MNIST dataset... We select 50 samples each from digit 0 and 1 respectively (i.e., n = 100).
Hardware Specification No The paper describes the experimental methodology and setup but does not specify any particular hardware used for running the experiments, such as GPU models, CPU types, or cloud computing instances.
Software Dependencies No The paper mentions the use of 'gradient descent' and 'neural networks' but does not provide specific version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used in the implementation.
Experiment Setup Yes We set sample size and network width to be n = 30 and m = 20 and initialize the weights to be Gaussian with a standard deviation σ0 = 0.001. We vary the choice of µ to create two problem settings: (1) low SNR with µ = 5, which leads to n SNR2 = 0.75 and (2) high SNR with µ = 15, which leads to n SNR2 = 6.75. We use the same two-layer networks introduced in Section 2. For classification, we set a learning rate of η = 0.1 and train for 500 iterations. For diffusion model, we minimize the DDPM loss by averaging over the diffusion noise, following the standard training of diffusion model. In particular, for each sample, we samples nϵ = 2000 noise at each iteration and the loss is calculated by taking an average over the noise. For the noise coefficients, we consider a time t = 0.2 and set αt = exp(−t) = 0.82 and βt = √1 − exp(−2t) = 0.57. For diffusion model, we set η = 0.5 and train for 40000 iterations.