reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalization in diffusion models arises from geometry-adaptive harmonic representations

Authors: Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, Stéphane Mallat

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We trained denoisers on subsets of the (downsampled) Celeb A dataset (Liu et al., 2015) of size N = 100, 101, 102, 103, 104, 105. We used a UNet architecture (Ronneberger et al., 2015)... Results are shown in Figure 1. When N = 1, the denoiser essentially memorizes the single training image, leading to a high test error. Increasing N substantially increases the performance on the test set while worsening performance on the training set, as the network transitions from memorization to generalization. At N = 105, empirical test and train error are matched for all noise levels.
Researcher Affiliation	Academia	Zahra Kadkhodaie Ctr. for Data Science, New York University EMAIL Florentin Guth Ctr. for Data Science, New York University Flatiron Institute, Simons Foundation EMAIL Eero P. Simoncelli New York University Flatiron Institute, Simons Foundation EMAIL Stéphane Mallat Collège de France Flatiron Institute, Simons Foundation EMAIL
Pseudocode	Yes	Algorithm 1 Sampling via ascent of the log-likelihood gradient from a denoiser residual
Open Source Code	Yes	Source code: https://github.com/Lab For Computational Vision/memorization_generalization_in_diffusion_models
Open Datasets	Yes	We trained denoisers on subsets of the (downsampled) Celeb A dataset (Liu et al., 2015) of size N = 100, 101, 102, 103, 104, 105. ... For experiments shown in Figures 9 and 10, we use images drawn from the LSUN bedroom dataset (Yu et al., 2015) downsampled to 80 × 80 resolution. ... For experiments shown in Figure 11 we use Celeb A HQ dataset (Karras et al., 2018) downsampled to 40 × 40 resolution.
Dataset Splits	No	The paper discusses training and test data performance (e.g., "At N = 10^5, empirical test and train error are matched"), but it does not explicitly mention a separate 'validation' dataset or its split details.
Hardware Specification	No	The paper acknowledges computing resources from the Flatiron Institute and NYU but does not specify any particular hardware components like GPU models, CPU types, or memory used for experiments.
Software Dependencies	No	The paper mentions using specific network architectures like UNet and BF-CNN, but it does not list any specific software libraries, frameworks, or operating systems with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Training is carried out on batches of size 512, for 1000 epochs. ... We chose h = 0.01, β = 0.1, σ0 = 1, and σ = 0.05.