REG: Rectified Gradient Guidance for Conditional Diffusion Models

Authors: Zhengqi Gao, Kaiwen Zha, Tianyuan Zhang, Zihui Xue, Duane S Boning

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional Image Net and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence. Our code is publicly available at: https://github.com/zhengqigao/REG/.
Researcher Affiliation Academia 1Massachusetts Institute of Technology. 2University of Texas at Austin. Correspondence to: Zhengqi Gao <EMAIL>.
Pseudocode No The paper describes the methodology using mathematical equations and prose but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes Our code is publicly available at: https://github.com/zhengqigao/REG/.
Open Datasets Yes Extensive experiments on class-conditional Image Net and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence. We evaluate various resolutions, including 64 64, 256 256, and 512 512, using Di T (Peebles & Xie, 2023) and EDM2 (Karras et al., 2024b) as baseline models. We evaluate different CFG approaches with and without REG on the COCO-2017 dataset (with 5,000 generated images) using FID and CLIP score as the evaluation metrics.
Dataset Splits Yes For text-to-image generation, we randomly select one caption per image from the COCO-2017 validation dataset, creating 5,000 pairs of images and captions. In class-conditional Image Net generation, we evaluate FID and IS metrics using 50,000 generated images, following the protocols outlined in Di T (Peebles & Xie, 2023) and EDM2 (Karras et al., 2024b).
Hardware Specification Yes We also note that similar inference-time gradient calculations have also been explored in universal guidance (Bansal et al., 2024), albeit in a different context. Table 5 presents a summary of the runtime and memory overhead introduced by REG on a single NVIDIA A40 GPU under identical experimental settings, compared to vanilla CFG approach.
Software Dependencies No We conduct our experiments using the open-source Di T (Peebles & Xie, 2023), EDM2 (Karras et al., 2024b), and Huggingface Diffusers codebases, modifying their source code to support various guidance techniques. Specific version numbers for these software dependencies are not provided.
Experiment Setup Yes In our noise prediction networks, we employ sinusoidal embeddings for time, class labels, and the coordinate input, each with a dimension of 128. These embeddings are concatenated and passed through an MLP with three hidden layers, each having 128 hidden units. We use 20 time steps with β linearly scheduled from 0.001 to 0.2 (i.e., αt is linear from α1 = 1 0.001 to α25 = 1 0.2 in the DDPM notation). The network is trained using Adam W with a learning rate of 0.001 for 200 epochs. In the 2D example, we design a two-class conditional image generation task inspired by (P arnamaa, 2023). ... We use 25 time steps with β linearly scheduled from 0.001 to 0.2 (i.e., αt transitions linearly from α1 = 1 0.001 to α25 = 1 0.2 in the DDPM notation). The diffusion model is trained using the Adam W optimizer with a learning rate of 0.0001 for 200 epochs.