A Simple Approach to Unifying Diffusion-based Conditional Generation

Authors: Xirui Li, Charles Herrmann, Kelvin Chan, Yinxiao Li, Deqing Sun, Chao Ma, Ming-Hsuan Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation. ... In Fig. 3, we show sample results generated by our Depth model on different tasks and compare them with a specific method for each task. ... We compare Uni Con with other methods on conditional generation and depth estimation. ... We conduct a user study to compare Uni Con models against their corresponding methods to evaluate the conditional generation performance. ... We ablate on training steps and data scale for depth conditional generation (Fig. 7).
Researcher Affiliation Collaboration Xirui Li1 Charles Herrmann2 Kelvin C.K. Chan2 Yinxiao Li2 Deqing Sun2 Chao Ma1 Ming-Hsuan Yang2 1 Shanghai Jiao Tong University 2 Google Deep Mind
Pseudocode No The paper describes the model architecture and training strategy using text and mathematical equations, but it does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code No Project webpage: https://lixirui142.github.io/unicon-diffusion/. This is a project webpage, not a specific code repository, and the paper does not explicitly state that the source code is released at this link.
Open Datasets Yes We train 4 Uni Con models, Depth, Soft Edge, Human-Pose (Pose), Human Identity (ID) on SDv1-5. ... We train Depth, Soft Edge models on 16k images from Pascal VOC (Everingham et al., 2012) and Pose model on a subset with 9k human images. ... We train the ID model on 30k human face images from Celeb A (Liu et al., 2015) ... We train the Appearance model on Panda-70M (Chen et al., 2024b). ... We evaluate on NYUv2 (Silberman et al., 2012) and Scan Net (Dai et al., 2017).
Dataset Splits No The paper mentions specific quantities of images used for training (e.g., "16k images from Pascal VOC", "9k human images", "30k human face images from Celeb A") and evaluation (e.g., "6K 512 512 images conditioned on depth"). However, it does not provide explicit training/validation/test splits by percentages or sample counts for reproducing the data partitioning of these datasets for their experiments, nor does it explicitly state the use of standard splits for evaluation benchmarks.
Hardware Specification Yes Our models are trained in about 13 hours on 2-4 Nvidia A800 80G GPUs. ... Training 20K steps costs about 13 hours on two NVIDIA A800 80G GPUs. ... The ID and Appearance models are trained for 20k steps with batch size 64 distributed on 4 NVIDIA A800 80G GPUs. ... The latency is tested on one A800 GPU.
Software Dependencies No The paper mentions using Stable Diffusion (Rombach et al., 2022) as a base model and the Adam W (Loshchilov, 2017) optimizer. It also states text prompts are generated by BLIP (Li et al., 2023; 2022). However, it does not provide specific version numbers for these or other software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes For all models, we use Adam W (Loshchilov, 2017) optimizer with learning rate 1e-4. The training images are resized to 512 resolution with random flipping and random cropping as data augmentation. The text prompts are generated by BLIP (Li et al., 2023; 2022) for datasets without captions. We drop out the text prompt input with a rate of 0.1 to maintain the classifier-free guidance (Ho & Salimans, 2022) ability. ... We use Lo RA rank 64 for all adapters... We train Depth, Soft Edge models for 20K steps with batch size 32 and Pose model for 10K steps. ... The ID and Appearance models are trained for 20k steps with batch size 64... For conditional generation comparison in Tab. 1, we use the DDIM (Song et al., 2020a) scheduler with eta=1.0. We sample for 50 steps and use a classifier-free guidance scale of 7.5. For depth estimation in Tab. 2, we use the Euler Ancestral scheduler (Karras et al., 2022) to sample 20 steps.