reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cascaded Diffusion Models for High Fidelity Image Generation

Authors: Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Experiments We designed experiments to improve the sample quality metrics of cascaded diﬀusion models on class-conditional Image Net generation. Our cascading pipelines consist of class-conditional diﬀusion models at all resolutions, so class information is injected at all resolutions: see Fig. 4. Our ﬁnal Image Net results are described in Section 4.1. To give insight into our cascading pipelines, we begin with improvements on a baseline non-cascaded model at the 64 64 resolution (Section 4.2), then we show that cascading up to 64 64 improves upon our best non-cascaded 64 64 model, but only in conjunction with conditioning augmentation. We also show that truncated and non-truncated conditioning augmentation perform equally well (Section 4.3), and we study random Gaussian blur augmentation to train super-resolution models to resolutions of 128 128 and 256 256 (Section 4.4). Finally, we verify that conditioning augmentation is also eﬀective on the LSUN dataset (Yu et al., 2015) and therefore is not speciﬁc to Image Net (Section 4.5).
Researcher Affiliation	Industry	Jonathan Ho EMAIL Chitwan Saharia EMAIL William Chan EMAIL David J. Fleet EMAIL Mohammad Norouzi EMAIL Tim Salimans EMAIL Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043
Pseudocode	Yes	Algorithm 1 Training a two-stage CDM with Gaussian conditioning augmentation 1: repeat Train base model 2: (z0, c) p(z, c) Sample low-resolution image and label 3: t U({1, . . . , T}) 4: ϵ N(0, I) 5: zt = αtz0 + 1 αtϵ 6: θ θ η θ ϵθ(zt, t, c) ϵ 2 Simple loss (can be replaced with a hybrid loss) 7: until converged 8: repeat Train super-resolution model (in parallel with the base model) 9: (x0, z0, c) p(x, z, c) Sample lowand high-resolution images and label 10: s, t U({1, . . . , T}) 11: ϵz, ϵx N(0, I) Note: ϵz, ϵx should have the same shapes as z0, x0, respectively 12: zt = αsz0 + 1 αsϵz Apply Gaussian conditioning augmentation 13: xt = αtx0 + 1 αtϵx 14: θ θ η θ ϵθ(xt, t, zs, s, c) ϵx 2 15: until converged Algorithm 2 Sampling from a two-stage CDM with Gaussian conditioning augmentation Require: c: class label Require: s: conditioning augmentation truncation time 1: z T N(0, I) 2: if using truncated conditioning augmentation then 3: for t = T, . . . , s + 1 do 4: zt 1 pθ(zt 1\|zt, c) 5: end for 6: else 7: for t = T, . . . , 1 do 8: zt 1 pθ(zt 1\|zt, c) 9: end for 10: zs q(zs\|z0) Overwrite previously sampled value of zs 11: end if 12: x T N(0, I) 13: for t = T, . . . , 1 do 14: xt 1 pθ(xt 1\|xt, zs, c) 15: end for 16: return x0
Open Source Code	No	High resolution ﬁgures and additional supplementary material can be found at https://cascaded-diffusion.github.io/. (The paper mentions a website for supplementary material but does not explicitly state that source code is provided there.)
Open Datasets	Yes	Our key contribution is the use of cascades to improve the sample quality of diﬀusion models on class-conditional Image Net . ... We cropped and resized the Image Net dataset (Russakovsky et al., 2015) in the same manner as Big GAN (Brock et al., 2019). ... Finally, we verify that conditioning augmentation is also eﬀective on the LSUN dataset (Yu et al., 2015) and therefore is not speciﬁc to Image Net (Section 4.5).
Dataset Splits	Yes	We cropped and resized the Image Net dataset (Russakovsky et al., 2015) in the same manner as Big GAN (Brock et al., 2019). We report Inception scores using the standard practice of generating 50k samples and calculating the mean and standard deviation over 10 splits (Salimans et al., 2016). Generally, throughout our experiments, we selected models and performed early stopping based on FID score calculated over 10k samples, but all reported FID scores are calculated over 50k samples for comparison with other work (Heusel et al., 2017). The FID scores we used for model selection and reporting model performance are calculated against training set statistics according to common practice, but since this can be seen as overﬁtting on the performance metric, we additionally report model performance using FID scores calculated against validation set statistics. ... (The relatively large FID scores between generated examples and the validation sets are explained by the fact that the LSUN Church and Bedroom validation sets are extremely small, consisting of only 300 examples each.)
Hardware Specification	Yes	Hardware: 256 TPU-v3 cores
Software Dependencies	No	The paper describes various model architectures, optimizers (e.g., Adam), and loss functions, but it does not specify concrete version numbers for software dependencies such as programming languages or libraries used for implementation.
Experiment Setup	Yes	Appendix B. Hyperparameters Here we give the hyperparameters of the models in our Image Net cascading pipelines. Each model in the pipeline is described by its diﬀusion process, its neural network architecture, and its training hyperparameters. Architecture hyperparameters, such as the base channel count and the list of channel multipliers per resolution, refer to hyperparameters of the U-Net in DDPM and related models (Ho et al., 2020; Nichol and Dhariwal, 2021; Saharia et al., 2021; Salimans et al., 2017). The cosine noise schedule and the hybrid loss method of learning reverse process variances are from Improved DDPM (Nichol and Dhariwal, 2021). Some models are conditioned on αt for post-training sampler tuning (Chen et al., 2021; Saharia et al., 2021). 32 32 base model Architecture Base channels: 256 Channel multipliers: 1, 2, 3, 4 Residual blocks per resolution: 6 Attention resolutions: 8, 16 Attention heads: 4 Optimizer: Adam Batch size: 2048 Learning rate: 1e-4 Steps: 700000 Dropout: 0.1 EMA: 0.9999 Hardware: 256 TPU-v3 cores Timesteps: 4000 Noise schedule: cosine Learned variances: yes Loss: hybrid