Cascaded Diffusion Models for High Fidelity Image Generation

Authors: Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments We designed experiments to improve the sample quality metrics of cascaded diffusion models on class-conditional Image Net generation. Our cascading pipelines consist of class-conditional diffusion models at all resolutions, so class information is injected at all resolutions: see Fig. 4. Our final Image Net results are described in Section 4.1. To give insight into our cascading pipelines, we begin with improvements on a baseline non-cascaded model at the 64 64 resolution (Section 4.2), then we show that cascading up to 64 64 improves upon our best non-cascaded 64 64 model, but only in conjunction with conditioning augmentation. We also show that truncated and non-truncated conditioning augmentation perform equally well (Section 4.3), and we study random Gaussian blur augmentation to train super-resolution models to resolutions of 128 128 and 256 256 (Section 4.4). Finally, we verify that conditioning augmentation is also effective on the LSUN dataset (Yu et al., 2015) and therefore is not specific to Image Net (Section 4.5).
Researcher Affiliation Industry Jonathan Ho EMAIL Chitwan Saharia EMAIL William Chan EMAIL David J. Fleet EMAIL Mohammad Norouzi EMAIL Tim Salimans EMAIL Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043
Pseudocode Yes Algorithm 1 Training a two-stage CDM with Gaussian conditioning augmentation 1: repeat Train base model 2: (z0, c) p(z, c) Sample low-resolution image and label 3: t U({1, . . . , T}) 4: ϵ N(0, I) 5: zt = αtz0 + 1 αtϵ 6: θ θ η θ ϵθ(zt, t, c) ϵ 2 Simple loss (can be replaced with a hybrid loss) 7: until converged 8: repeat Train super-resolution model (in parallel with the base model) 9: (x0, z0, c) p(x, z, c) Sample lowand high-resolution images and label 10: s, t U({1, . . . , T}) 11: ϵz, ϵx N(0, I) Note: ϵz, ϵx should have the same shapes as z0, x0, respectively 12: zt = αsz0 + 1 αsϵz Apply Gaussian conditioning augmentation 13: xt = αtx0 + 1 αtϵx 14: θ θ η θ ϵθ(xt, t, zs, s, c) ϵx 2 15: until converged Algorithm 2 Sampling from a two-stage CDM with Gaussian conditioning augmentation Require: c: class label Require: s: conditioning augmentation truncation time 1: z T N(0, I) 2: if using truncated conditioning augmentation then 3: for t = T, . . . , s + 1 do 4: zt 1 pθ(zt 1|zt, c) 5: end for 6: else 7: for t = T, . . . , 1 do 8: zt 1 pθ(zt 1|zt, c) 9: end for 10: zs q(zs|z0) Overwrite previously sampled value of zs 11: end if 12: x T N(0, I) 13: for t = T, . . . , 1 do 14: xt 1 pθ(xt 1|xt, zs, c) 15: end for 16: return x0
Open Source Code No High resolution figures and additional supplementary material can be found at https://cascaded-diffusion.github.io/. (The paper mentions a website for supplementary material but does not explicitly state that source code is provided there.)
Open Datasets Yes Our key contribution is the use of cascades to improve the sample quality of diffusion models on class-conditional Image Net . ... We cropped and resized the Image Net dataset (Russakovsky et al., 2015) in the same manner as Big GAN (Brock et al., 2019). ... Finally, we verify that conditioning augmentation is also effective on the LSUN dataset (Yu et al., 2015) and therefore is not specific to Image Net (Section 4.5).
Dataset Splits Yes We cropped and resized the Image Net dataset (Russakovsky et al., 2015) in the same manner as Big GAN (Brock et al., 2019). We report Inception scores using the standard practice of generating 50k samples and calculating the mean and standard deviation over 10 splits (Salimans et al., 2016). Generally, throughout our experiments, we selected models and performed early stopping based on FID score calculated over 10k samples, but all reported FID scores are calculated over 50k samples for comparison with other work (Heusel et al., 2017). The FID scores we used for model selection and reporting model performance are calculated against training set statistics according to common practice, but since this can be seen as overfitting on the performance metric, we additionally report model performance using FID scores calculated against validation set statistics. ... (The relatively large FID scores between generated examples and the validation sets are explained by the fact that the LSUN Church and Bedroom validation sets are extremely small, consisting of only 300 examples each.)
Hardware Specification Yes Hardware: 256 TPU-v3 cores
Software Dependencies No The paper describes various model architectures, optimizers (e.g., Adam), and loss functions, but it does not specify concrete version numbers for software dependencies such as programming languages or libraries used for implementation.
Experiment Setup Yes Appendix B. Hyperparameters Here we give the hyperparameters of the models in our Image Net cascading pipelines. Each model in the pipeline is described by its diffusion process, its neural network architecture, and its training hyperparameters. Architecture hyperparameters, such as the base channel count and the list of channel multipliers per resolution, refer to hyperparameters of the U-Net in DDPM and related models (Ho et al., 2020; Nichol and Dhariwal, 2021; Saharia et al., 2021; Salimans et al., 2017). The cosine noise schedule and the hybrid loss method of learning reverse process variances are from Improved DDPM (Nichol and Dhariwal, 2021). Some models are conditioned on αt for post-training sampler tuning (Chen et al., 2021; Saharia et al., 2021). 32 32 base model Architecture Base channels: 256 Channel multipliers: 1, 2, 3, 4 Residual blocks per resolution: 6 Attention resolutions: 8, 16 Attention heads: 4 Optimizer: Adam Batch size: 2048 Learning rate: 1e-4 Steps: 700000 Dropout: 0.1 EMA: 0.9999 Hardware: 256 TPU-v3 cores Timesteps: 4000 Noise schedule: cosine Learned variances: yes Loss: hybrid