reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

REG: Rectified Gradient Guidance for Conditional Diffusion Models

Authors: Zhengqi Gao, Kaiwen Zha, Tianyuan Zhang, Zihui Xue, Duane S Boning

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional Image Net and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence. Our code is publicly available at: https://github.com/zhengqigao/REG/.
Researcher Affiliation	Academia	1Massachusetts Institute of Technology. 2University of Texas at Austin. Correspondence to: Zhengqi Gao <EMAIL>.
Pseudocode	No	The paper describes the methodology using mathematical equations and prose but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Our code is publicly available at: https://github.com/zhengqigao/REG/.
Open Datasets	Yes	Extensive experiments on class-conditional Image Net and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence. We evaluate various resolutions, including 64 64, 256 256, and 512 512, using Di T (Peebles & Xie, 2023) and EDM2 (Karras et al., 2024b) as baseline models. We evaluate different CFG approaches with and without REG on the COCO-2017 dataset (with 5,000 generated images) using FID and CLIP score as the evaluation metrics.
Dataset Splits	Yes	For text-to-image generation, we randomly select one caption per image from the COCO-2017 validation dataset, creating 5,000 pairs of images and captions. In class-conditional Image Net generation, we evaluate FID and IS metrics using 50,000 generated images, following the protocols outlined in Di T (Peebles & Xie, 2023) and EDM2 (Karras et al., 2024b).
Hardware Specification	Yes	We also note that similar inference-time gradient calculations have also been explored in universal guidance (Bansal et al., 2024), albeit in a different context. Table 5 presents a summary of the runtime and memory overhead introduced by REG on a single NVIDIA A40 GPU under identical experimental settings, compared to vanilla CFG approach.
Software Dependencies	No	We conduct our experiments using the open-source Di T (Peebles & Xie, 2023), EDM2 (Karras et al., 2024b), and Huggingface Diffusers codebases, modifying their source code to support various guidance techniques. Specific version numbers for these software dependencies are not provided.
Experiment Setup	Yes	In our noise prediction networks, we employ sinusoidal embeddings for time, class labels, and the coordinate input, each with a dimension of 128. These embeddings are concatenated and passed through an MLP with three hidden layers, each having 128 hidden units. We use 20 time steps with β linearly scheduled from 0.001 to 0.2 (i.e., αt is linear from α1 = 1 0.001 to α25 = 1 0.2 in the DDPM notation). The network is trained using Adam W with a learning rate of 0.001 for 200 epochs. In the 2D example, we design a two-class conditional image generation task inspired by (P arnamaa, 2023). ... We use 25 time steps with β linearly scheduled from 0.001 to 0.2 (i.e., αt transitions linearly from α1 = 1 0.001 to α25 = 1 0.2 in the DDPM notation). The diffusion model is trained using the Adam W optimizer with a learning rate of 0.0001 for 200 epochs.