Boosting Latent Diffusion with Perceptual Objectives

Authors: Tariq Berrada, Pietro Astolfi, Melissa Hall, Marton Havasi, Yohann Benchetrit, Adriana Romero-Soriano, Karteek Alahari, Michal Drozdzal, Jakob Verbeek

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative with boosts between 6% and 20% in FID and qualitative results when using our perceptual loss.
Researcher Affiliation Collaboration 1 FAIR at Meta, 2 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, France 3 Mc Gill University, 4 Mila, Quebec AI institute, 5 Canada CIFAR AI chair EMAIL
Pseudocode No The paper describes methods and equations but does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about providing source code or a link to a code repository.
Open Datasets Yes We conduct an extensive evaluation on three datasets of different scales and distributions: Image Net-1k (Deng et al., 2009), CC12M (Changpinyo et al., 2021), and S320M: a large internal dataset of 320M stock images.
Dataset Splits Yes We evaluate metrics with respect to Image Net-1k and, for models trained on CC12M and S320M, the validation set of CC12M.
Hardware Specification No The paper does not explicitly describe the specific hardware used for running its experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions various models and algorithms used (e.g., DDPM-ϵ, DDIM, Florence-2) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes Unless specified otherwise, we follow the DDPM-ϵ training paradigm (Ho et al., 2020), using the DDIM (Song et al., 2021) algorithm with 50 steps for sampling and a classifier-free guidance scale of λ = 2.0 (Ho & Salimans, 2021). Following Podell et al. (2024), we use a quadratic scheduler with βstart = 0.00085 and βend = 0.012. ... we pre-train all models at 256 resolution on the dataset of interest for 600k iterations. We then enter a second phase of training, in which we optionally apply our perceptual loss, which lasts for 200k iterations for 256 resolution models and for 120k iterations for models at 512 resolution. ... we use a guidance scale of 1.5 for resolutions of 256 and 2.0 for resolutions of 512, which we also found to be optimal for our baseline models trained without LPL.