Can Diffusion Models Learn Hidden Inter-Feature Rules Behind Images?
Authors: Yujin Han, Andi Han, Wei Huang, Chaochao Lu, Difan Zou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on mainstream DMs (e.g., Stable Diffusion 3.5) reveal consistent failures, such as inconsistent lighting-shadow relationships and mismatched object-mirror reflections. Inspired by these findings, we design four synthetic tasks with strongly correlated features to assess DMs rule-learning abilities. Extensive experiments show that while DMs can identify coarse-grained rules, they struggle with fine-grained ones. Our theoretical analysis demonstrates that DMs trained via denoising score matching (DSM) exhibit constant errors in learning hidden rules, as the DSM objective is not compatible with rule conformity. |
| Researcher Affiliation | Academia | 1The University of Hong Kong 2The University of Sydney 3RIKEN AIP 4Shanghai AI Laboratory. Correspondence to: Yujin Han <EMAIL>, Andi Han <EMAIL>, Difan Zou <EMAIL>. |
| Pseudocode | No | The paper describes a "Pipeline for extracting features" in Figure 2, but it is a diagrammatic representation and not structured pseudocode or an algorithm block. There are no other sections explicitly labeled "Pseudocode" or "Algorithm". |
| Open Source Code | No | The paper references existing open-source models and platforms like Hugging Face and Fal.ai (e.g., Stable Diffusion 3.5, Flux.1 Dev, SDXL, Infinity) that were used or discussed. However, it does not explicitly state that the authors' own implementation code for the methodology described in this paper is released, nor does it provide a direct link to such a repository or state that it is available in supplementary material. |
| Open Datasets | Yes | Specifically, we considers the Syn Mirror (Dhiman et al., 2024) dataset, which presents objects and their reflections, where rules connect features such as color, size, and shape. ... We also construct the Cifar-MNIST dataset, which pairs specific CIFAR and MNIST classes (e.g., Cats/Dogs with 0/1). |
| Dataset Splits | Yes | 4000, 2000, 2000, and 2000 samples are generated for synthetic task A, B, C and D, respectively, with an image size of 32 32. ... Based on the contrastive data constructed in Figure 19, we split the training and test data in an 80:20 ratio and directly train a three-way classifier in the raw image space... |
| Hardware Specification | Yes | The training is performed on a single NVIDIA A800 GPU for 400, 800, 1600, and 1000 epochs, respectively. |
| Software Dependencies | No | The paper mentions using "Adam W (Loshchilov, 2017) as the optimizer" and the "U-Net architecture (Ronneberger et al., 2015) as the denoiser." However, it does not provide specific version numbers for software components like Python, PyTorch, or other libraries, which are necessary for reproducible software dependencies. |
| Experiment Setup | Yes | Following the training setting (Aithal et al., 2024), we fix the total timesteps at T = 1000 and employ the widely-used U-Net architecture (Ronneberger et al., 2015) as the denoiser. ... We use Adam W (Loshchilov, 2017) as the optimizer with a learning rate of 3e 4. The noisy steps are set to T = 1000, with a linear noise schedule ranging from 1e 4 to 2e 2. For Tasks A, B, C, and D, the sample sizes are 4000, 2000, 2000, and 2000, respectively, and the input data size is (3, 32, 32). The training is performed on a single NVIDIA A800 GPU for 400, 800, 1600, and 1000 epochs, respectively. |