Boosting Latent Diffusion with Perceptual Objectives
Authors: Tariq Berrada, Pietro Astolfi, Melissa Hall, Marton Havasi, Yohann Benchetrit, Adriana Romero-Soriano, Karteek Alahari, Michal Drozdzal, Jakob Verbeek
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative with boosts between 6% and 20% in FID and qualitative results when using our perceptual loss. |
| Researcher Affiliation | Collaboration | 1 FAIR at Meta, 2 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, France 3 Mc Gill University, 4 Mila, Quebec AI institute, 5 Canada CIFAR AI chair EMAIL |
| Pseudocode | No | The paper describes methods and equations but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about providing source code or a link to a code repository. |
| Open Datasets | Yes | We conduct an extensive evaluation on three datasets of different scales and distributions: Image Net-1k (Deng et al., 2009), CC12M (Changpinyo et al., 2021), and S320M: a large internal dataset of 320M stock images. |
| Dataset Splits | Yes | We evaluate metrics with respect to Image Net-1k and, for models trained on CC12M and S320M, the validation set of CC12M. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions various models and algorithms used (e.g., DDPM-ϵ, DDIM, Florence-2) but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | Unless specified otherwise, we follow the DDPM-ϵ training paradigm (Ho et al., 2020), using the DDIM (Song et al., 2021) algorithm with 50 steps for sampling and a classifier-free guidance scale of λ = 2.0 (Ho & Salimans, 2021). Following Podell et al. (2024), we use a quadratic scheduler with βstart = 0.00085 and βend = 0.012. ... we pre-train all models at 256 resolution on the dataset of interest for 600k iterations. We then enter a second phase of training, in which we optionally apply our perceptual loss, which lasts for 200k iterations for 256 resolution models and for 120k iterations for models at 512 resolution. ... we use a guidance scale of 1.5 for resolutions of 256 and 2.0 for resolutions of 512, which we also found to be optimal for our baseline models trained without LPL. |