Diffusion Bridge Implicit Models

Authors: Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we show that DBIMs surpass the original sampling procedure of DDBMs by a large margin, in terms of both sample quality and sample efficiency. We also showcase DBIM s capabilities in latent-space encoding, reconstruction, and interpolation using deterministic sampling. All comparisons between DBIMs and DDBMs are conducted using identically trained models. For DDBMs, we employ their proposed hybrid sampler for sampling. We conduct experiments including (1) image-to-image translation tasks on Edges Handbags (Isola et al., 2017) (64 64) and DIODE-Outdoor (Vasiljevic et al., 2019) (256 256) (2) image restoration task of inpainting on Image Net (Deng et al., 2009) (256 256) with 128 128 center mask. We report the Fr echet inception distance (FID) (Heusel et al., 2017) for all experiments, and additionally measure Inception Scores (IS) (Barratt & Sharma, 2018), Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018), Mean Square Error (MSE) (for image-to-image translation) and Classifier Accuracy (CA) (for image inpainting), following previous works (Liu et al., 2023b; Zhou et al., 2023).
Researcher Affiliation Collaboration Kaiwen Zheng12 , Guande He1 , Jianfei Chen1, Fan Bao2, Jun Zhu 123 1Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center 1Tsinghua-Bosch Joint ML Center, Tsinghua University, Beijing, China 2Shengshu Technology, Beijing 3Pazhou Lab (Huangpu), Guangzhou, China EMAIL; EMAIL; EMAIL; EMAIL
Pseudocode Yes E ALGORITHM Algorithm 1 DBIM (high-order) Require: condition x T , timesteps 0 t0 < t1 < < t N 1 < t N = T, data prediction model xθ, booting noise ϵ N(0, I), noise schedule at, bt, ct, λt = log(bt/ct), order o (2 or 3). 1: ˆx T xθ(x T , T, x T ) 2: xt N 1 atx T + bt ˆx T + ctϵ 3: for i N 1 to 1 do 4: s, t ti 1, ti; h λs λt 5: ˆxt xθ(xt, t, x T ) 6: if o = 2 or i = N 1 then 7: u ti+1; h1 λt λu 8: Estimate ˆx(1) t with Eqn. (62) 9: ˆI eλs h (1 e h)ˆxt + (h 1 + e h)ˆx(1) t i 10: else 11: u1, u2 ti+1, ti+2; h1 λt λu1; h2 λu1 λu2 12: Estimate ˆx(1) t , ˆx(2) t with Eqn. (64) 13: ˆI eλs h (1 e h)ˆxt + (h 1 + e h)ˆx(1) t + h2 2 h + 1 e h ˆx(2) t i 14: end if 15: xs cs ct xt + as cs ct at x T + cs ˆI 16: end for 17: return xt0
Open Source Code Yes Code is available at https://github.com/thu-ml/DiffusionBridge.
Open Datasets Yes Table 7: The used datasets, codes and their licenses. Name URL Citation License Edges Handbags https://github.com/junyanz/pytorch-Cycle GAN-and-pix2pix Isola et al. (2017) BSD DIODE-Outdoor https://diode-dataset.org/ Vasiljevic et al. (2019) MIT Image Net https://www.image-net.org Deng et al. (2009) \
Dataset Splits Yes We conduct experiments including (1) image-to-image translation tasks on Edges Handbags (Isola et al., 2017) (64 64) and DIODE-Outdoor (Vasiljevic et al., 2019) (256 256) (2) image restoration task of inpainting on Image Net (Deng et al., 2009) (256 256) with 128 128 center mask. The metrics are computed using the complete training set for Edges Handbags and DIODE-Outdoor, and 10k images from validation set for Image Net.
Hardware Specification Yes We train the model on 8 NVIDIA A800 GPU cards with a batch size of 256 for 400k iterations, which takes around 19 days. Table 8 shows the inference time of DBIM and previous methods on a single NVIDIA A100 under different settings.
Software Dependencies No The paper does not explicitly mention specific software dependencies with version numbers.
Experiment Setup Yes For the image inpainting task on Image Net 256 256 with 128 128 center mask, DDBMs do not provide available checkpoints. Therefore, we train a new model from scratch using the noise schedule of I2SB (Liu et al., 2023b). The network is initialized from the pretrained class-conditional diffusion model on Image Net 256 256 provided by (Dhariwal & Nichol, 2021), while additionally conditioned on x T . The data prediction model in this case is parameterized by the network Fθ as xθ(xt, t, x T ) = xt σt Fθ(xt, t, x T ) and trained by minimizing the loss L(θ) = Et,x0,x T h 1 σ2 t xθ(xt, t, x T ) x0 2 2 i . We train the model on 8 NVIDIA A800 GPU cards with a batch size of 256 for 400k iterations, which takes around 19 days. We elaborate on the sampling configurations of different approaches, including the choice of timesteps {ti}N i=0 and details of the samplers. In this work, we adopt tmin = 0.0001 and tmax = 1 following (Zhou et al., 2023). For the DDBM baseline, we use the hybrid, high-order Heun sampler proposed in their work with an Euler step ratio of 0.33, which is the best performing configuration for the image-to-image translation task. We use the same timesteps distributed according to EDM (Karras et al., 2022) s scheduling (t1/ρ max + i / N (t1/ρ min - t1/ρ max))ρ, consistent with the official implementation of DDBM. For DBIM, since the initial sampling step is distinctly forced to be stochastic, we specifically set it to transition from tmax to tmax - 0.0001, and employ a simple uniformly distributed timestep scheme in [tmin, tmax - 0.0001) for the remaining timesteps, across all settings. For interpolation experiments, to enhance diversity, we increase the step size of the first step from 0.0001 to 0.01.