REG: Rectified Gradient Guidance for Conditional Diffusion Models
Authors: Zhengqi Gao, Kaiwen Zha, Tianyuan Zhang, Zihui Xue, Duane S Boning
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional Image Net and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence. Our code is publicly available at: https://github.com/zhengqigao/REG/. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology. 2University of Texas at Austin. Correspondence to: Zhengqi Gao <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using mathematical equations and prose but does not include any explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/zhengqigao/REG/. |
| Open Datasets | Yes | Extensive experiments on class-conditional Image Net and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence. We evaluate various resolutions, including 64 64, 256 256, and 512 512, using Di T (Peebles & Xie, 2023) and EDM2 (Karras et al., 2024b) as baseline models. We evaluate different CFG approaches with and without REG on the COCO-2017 dataset (with 5,000 generated images) using FID and CLIP score as the evaluation metrics. |
| Dataset Splits | Yes | For text-to-image generation, we randomly select one caption per image from the COCO-2017 validation dataset, creating 5,000 pairs of images and captions. In class-conditional Image Net generation, we evaluate FID and IS metrics using 50,000 generated images, following the protocols outlined in Di T (Peebles & Xie, 2023) and EDM2 (Karras et al., 2024b). |
| Hardware Specification | Yes | We also note that similar inference-time gradient calculations have also been explored in universal guidance (Bansal et al., 2024), albeit in a different context. Table 5 presents a summary of the runtime and memory overhead introduced by REG on a single NVIDIA A40 GPU under identical experimental settings, compared to vanilla CFG approach. |
| Software Dependencies | No | We conduct our experiments using the open-source Di T (Peebles & Xie, 2023), EDM2 (Karras et al., 2024b), and Huggingface Diffusers codebases, modifying their source code to support various guidance techniques. Specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | In our noise prediction networks, we employ sinusoidal embeddings for time, class labels, and the coordinate input, each with a dimension of 128. These embeddings are concatenated and passed through an MLP with three hidden layers, each having 128 hidden units. We use 20 time steps with β linearly scheduled from 0.001 to 0.2 (i.e., αt is linear from α1 = 1 0.001 to α25 = 1 0.2 in the DDPM notation). The network is trained using Adam W with a learning rate of 0.001 for 200 epochs. In the 2D example, we design a two-class conditional image generation task inspired by (P arnamaa, 2023). ... We use 25 time steps with β linearly scheduled from 0.001 to 0.2 (i.e., αt transitions linearly from α1 = 1 0.001 to α25 = 1 0.2 in the DDPM notation). The diffusion model is trained using the Adam W optimizer with a learning rate of 0.0001 for 200 epochs. |