Generalized Consistency Trajectory Models for Image Manipulation
Authors: Beomsu Kim, Jaemin Kim, Jeongsol Kim, Jong Chul YE
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the potential of GCTMs on unconditional generation, image-to-image translation, image restoration, image editing, and latent manipulation. We show that GCTMs achieve competitive performance even with NFE = 1. 5 EXPERIMENTS We now explore the possibilities of GCTMs on unconditional generation, image-to-image translation, image restoration, image editing, and latent manipulation. |
| Researcher Affiliation | Academia | Beomsu Kim , Jaemin Kim , Jeongsol Kim & Jong Chul Ye KAIST EMAIL |
| Pseudocode | Yes | Algorithm 1 q(x0, x1) Sampling Algorithm 2 GCTM Training Algorithm 3 Sinkhorn-Knopp (SK) Algorithm 4 Zero-shot Image Restoration Algorithm 5 Image Editing |
| Open Source Code | Yes | Code is available at https://github.com/1202kbs/GCTM. Reproducibility. We open-source our code at https://github.com/1202kbs/GCTM including training code for unconditional generation, image-to-image translation, and supervised image restoration models. |
| Open Datasets | Yes | In unconditional generation task, we compare our GCTM generation performance using CIFAR10 training dataset. In image-to-image translation task, we evaluate the performance of models using test sets of Edges Shoes, Night Day, Facades from Pix2Pix. In image restoration task, we use FFHQ and apply following corruption operators H from I2SB to obtain measurement: bicubic super-resolution with a factor of 2, Gaussian deblurring with σ = 0.8, and center inpainting with Gaussian. In Table 6, we demonstrate image restoration task of GCTM on Image Net with higher resolution (256 256 resolution). |
| Dataset Splits | Yes | In unconditional generation task, we compare our GCTM generation performance using CIFAR10 training dataset. In image-to-image translation task, we evaluate the performance of models using test sets of Edges Shoes, Night Day, Facades from Pix2Pix. In image restoration task, we use FFHQ and apply following corruption operators H from I2SB to obtain measurement: bicubic super-resolution with a factor of 2, Gaussian deblurring with σ = 0.8, and center inpainting with Gaussian. We then assess model performance using test dataset. To obtain FID scores, and in the other task, we sample 5,000 test datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments. It only mentions general experimental settings. |
| Software Dependencies | No | The paper mentions the use of 'Adam optimizer (Kingma & Ba, 2015)' and several tools for metrics calculation such as 'pytorch-fid', 'torchvision/models/inception.py', 'Perceptual Similarity with Alex Net version 0.1', and 'scikit-image'. However, it does not specify version numbers for Python, PyTorch, CUDA, or any other primary libraries/frameworks used for implementation, except for 'Alex Net version 0.1' which refers to a model version rather than the software library version. |
| Experiment Setup | Yes | A.1 TRAINING Bootstrapping scores. In all our experiments, we train GCTMs without a pre-trained score model. So, analogous to CTMs, we use velocity estimates given by an exponential moving average θEMA of θ to solve ODEs. We use exponential moving average decay rate 0.999. Time discretization. In practice, we discretize the unit interval into a finite number of timesteps {tn}N n=0 where t0 = 0 < t1 < < t N = 1 (24) and learn ODE trajectories integrated with respect to the discretization schedule. EDM (Karras et al., 2022), which has shown robust performance on a variety of generation tasks, solves the PFODE on the time interval (σmin, σmax) for 0 < σmin < σmax according to the discretization schedule σn = (σ1/ρ min + (n/N)(σ1/ρ max σ1/ρ min))ρ (25) for n = 0, . . . , N and ρ = 7. Thus, using the change of time variable (17) derived in Theorem 1, we convert PFODE EDM schedule to FM ODE discretization t0 = 0, tn = σn/(1 + σn) for n = 1, . . . , N 1, t N = 1. (26) In our experiments, we fix σmin = 0.002 and control σmax. We note that σmax controls the amount of emphasis on time near t = 1, i.e., larger σmax places more time discretization points near t = 1. Number of discretization steps N. CTMs use fixed N = 18. In contrast, analogous to i CMs, we double N every 100k iterations, starting from N = 4. Time tˆ distribution. For unconditional generation, we sample t = σ/(1 + σ), log σ N( 1.2, 1.22) (27) in accordance with EDM. For image-to-image translation, we sample t beta(3, 1). (28) Network conditioning. We use the EDM conditioning, following CTMs. Distance d. CTMs use d defined as d(xt, ˆxt) = LPIPS(GθEMA(xt, t, 0), GθEMA(ˆxt, t, 0)) (29) which compares the perceptual distance of samples projected to time t = 0. In contrast, following i CMs, we use the pseudo-huber loss d(xt, ˆxt) = q xt ˆxt 2 2 + c2 c (30) where c = 0.00054 d, where d is the dimension of xt. Batch size. We use batch size 128 for 32 32 resolution images and batch size 64 for 64 64 resolution images. Optimizer. We use the Adam optimizer (Kingma & Ba, 2015) with learning rate η = 0.0002/(128/batch size) (31) and default (β1, β2) = (0.9, 0.999). Coefficient for LFM(θ). We use λFM = 0.1 for all experiments. Network. We modify Song UNet provided at https://github.com/NVlabs/edm to accept two time conditions t and s by using two time embedding layers. ODE Solver. We use the second order Heun solver to calculate LGCTM(θ). Gaussian perturbation. We apply a Gaussian perturbation from a normal distribution multiplied by 0.05 to sample x1, excluding inpainting task. |