Cascaded Diffusion Models for High Fidelity Image Generation
Authors: Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments We designed experiments to improve the sample quality metrics of cascaded diffusion models on class-conditional Image Net generation. Our cascading pipelines consist of class-conditional diffusion models at all resolutions, so class information is injected at all resolutions: see Fig. 4. Our final Image Net results are described in Section 4.1. To give insight into our cascading pipelines, we begin with improvements on a baseline non-cascaded model at the 64 64 resolution (Section 4.2), then we show that cascading up to 64 64 improves upon our best non-cascaded 64 64 model, but only in conjunction with conditioning augmentation. We also show that truncated and non-truncated conditioning augmentation perform equally well (Section 4.3), and we study random Gaussian blur augmentation to train super-resolution models to resolutions of 128 128 and 256 256 (Section 4.4). Finally, we verify that conditioning augmentation is also effective on the LSUN dataset (Yu et al., 2015) and therefore is not specific to Image Net (Section 4.5). |
| Researcher Affiliation | Industry | Jonathan Ho EMAIL Chitwan Saharia EMAIL William Chan EMAIL David J. Fleet EMAIL Mohammad Norouzi EMAIL Tim Salimans EMAIL Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043 |
| Pseudocode | Yes | Algorithm 1 Training a two-stage CDM with Gaussian conditioning augmentation 1: repeat Train base model 2: (z0, c) p(z, c) Sample low-resolution image and label 3: t U({1, . . . , T}) 4: ϵ N(0, I) 5: zt = αtz0 + 1 αtϵ 6: θ θ η θ ϵθ(zt, t, c) ϵ 2 Simple loss (can be replaced with a hybrid loss) 7: until converged 8: repeat Train super-resolution model (in parallel with the base model) 9: (x0, z0, c) p(x, z, c) Sample lowand high-resolution images and label 10: s, t U({1, . . . , T}) 11: ϵz, ϵx N(0, I) Note: ϵz, ϵx should have the same shapes as z0, x0, respectively 12: zt = αsz0 + 1 αsϵz Apply Gaussian conditioning augmentation 13: xt = αtx0 + 1 αtϵx 14: θ θ η θ ϵθ(xt, t, zs, s, c) ϵx 2 15: until converged Algorithm 2 Sampling from a two-stage CDM with Gaussian conditioning augmentation Require: c: class label Require: s: conditioning augmentation truncation time 1: z T N(0, I) 2: if using truncated conditioning augmentation then 3: for t = T, . . . , s + 1 do 4: zt 1 pθ(zt 1|zt, c) 5: end for 6: else 7: for t = T, . . . , 1 do 8: zt 1 pθ(zt 1|zt, c) 9: end for 10: zs q(zs|z0) Overwrite previously sampled value of zs 11: end if 12: x T N(0, I) 13: for t = T, . . . , 1 do 14: xt 1 pθ(xt 1|xt, zs, c) 15: end for 16: return x0 |
| Open Source Code | No | High resolution figures and additional supplementary material can be found at https://cascaded-diffusion.github.io/. (The paper mentions a website for supplementary material but does not explicitly state that source code is provided there.) |
| Open Datasets | Yes | Our key contribution is the use of cascades to improve the sample quality of diffusion models on class-conditional Image Net . ... We cropped and resized the Image Net dataset (Russakovsky et al., 2015) in the same manner as Big GAN (Brock et al., 2019). ... Finally, we verify that conditioning augmentation is also effective on the LSUN dataset (Yu et al., 2015) and therefore is not specific to Image Net (Section 4.5). |
| Dataset Splits | Yes | We cropped and resized the Image Net dataset (Russakovsky et al., 2015) in the same manner as Big GAN (Brock et al., 2019). We report Inception scores using the standard practice of generating 50k samples and calculating the mean and standard deviation over 10 splits (Salimans et al., 2016). Generally, throughout our experiments, we selected models and performed early stopping based on FID score calculated over 10k samples, but all reported FID scores are calculated over 50k samples for comparison with other work (Heusel et al., 2017). The FID scores we used for model selection and reporting model performance are calculated against training set statistics according to common practice, but since this can be seen as overfitting on the performance metric, we additionally report model performance using FID scores calculated against validation set statistics. ... (The relatively large FID scores between generated examples and the validation sets are explained by the fact that the LSUN Church and Bedroom validation sets are extremely small, consisting of only 300 examples each.) |
| Hardware Specification | Yes | Hardware: 256 TPU-v3 cores |
| Software Dependencies | No | The paper describes various model architectures, optimizers (e.g., Adam), and loss functions, but it does not specify concrete version numbers for software dependencies such as programming languages or libraries used for implementation. |
| Experiment Setup | Yes | Appendix B. Hyperparameters Here we give the hyperparameters of the models in our Image Net cascading pipelines. Each model in the pipeline is described by its diffusion process, its neural network architecture, and its training hyperparameters. Architecture hyperparameters, such as the base channel count and the list of channel multipliers per resolution, refer to hyperparameters of the U-Net in DDPM and related models (Ho et al., 2020; Nichol and Dhariwal, 2021; Saharia et al., 2021; Salimans et al., 2017). The cosine noise schedule and the hybrid loss method of learning reverse process variances are from Improved DDPM (Nichol and Dhariwal, 2021). Some models are conditioned on αt for post-training sampler tuning (Chen et al., 2021; Saharia et al., 2021). 32 32 base model Architecture Base channels: 256 Channel multipliers: 1, 2, 3, 4 Residual blocks per resolution: 6 Attention resolutions: 8, 16 Attention heads: 4 Optimizer: Adam Batch size: 2048 Learning rate: 1e-4 Steps: 700000 Dropout: 0.1 EMA: 0.9999 Hardware: 256 TPU-v3 cores Timesteps: 4000 Noise schedule: cosine Learned variances: yes Loss: hybrid |