G2D2: Gradient-Guided Discrete Diffusion for Inverse Problem Solving
Authors: Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon, Yuki Mitsufuji
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our proposed approach, we conduct comprehensive experiments comparing G2D2 to current methods using standard benchmark datasets. Our results demonstrate that G2D2 achieves comparable performance to continuous counterparts (within 0.02 0.05 LPIPS points) while reducing GPU memory usage by up to 77% (4.7Gi B vs 20.9Gi B for PSLD). We also explore the application of a discrete prior-based motion-data-generation model to solve an inverse problem, specifically path-conditioned generation, without requiring further training. The results of our study indicate that G2D2 shows promise in tackling various inverse problems by leveraging pre-trained discrete diffusion models. |
| Researcher Affiliation | Collaboration | 1Sony AI, 2Stanford University, 3Sony Group Corporation |
| Pseudocode | Yes | Algorithm 1 Gradient-Guided Discrete Diffusion, G2D2 |
| Open Source Code | Yes | Our code is available at https://github.com/sony/g2d2. |
| Open Datasets | Yes | Datasets Following previous studies, we use the Image Net (Deng et al., 2009) and Flickr-Faces-HQ (FFHQ) (Karras et al., 2019) datasets. The images are 256 256. For comparison, we use a subset of 1000 images from each validation set. |
| Dataset Splits | Yes | For comparison, we use a subset of 1000 images from each validation set. For our experiments with the FFHQ dataset, we use the 1000 images (indexed 0, 1, . . . , 999) from the validation set. We evaluate using the Human ML3D test set with sparse conditioning (5 frames out of 196) on metrics including: FID, RPrecision, Diversity, Foot Skating ratio, Trajectory error (50cm), Location error (50cm), and Average error. |
| Hardware Specification | Yes | All experiments are performed on one RTX 3090 (24 Gi B), but G2D2 itself never exceeded 4.7 Gi B of VRAM and required 194 s per Image Net image (Table 5 in the Appendix). Hence the full pipeline can be executed on widely available 8 Gi B cards. The measurements are conducted using a single NVIDIA A6000 GPU for the Gaussian deblurring task on Image Net. |
| Software Dependencies | No | The implementation of G2D2 is based on the VQ-Diffusion model from the diffusers library 4. For the prior model, we use the pre-trained model available at https://huggingface.co/microsoft/ vq-diffusion-ithq. In our experiments, the number of time steps T for sampling is set to 100. We use the clean-fid13 library for our evaluation. |
| Experiment Setup | Yes | The number of time steps T for sampling is set to 100. Parameterization of Star-Shaped Noise Process In G2D2, the star-shaped noise process follows the same cumulative transition probability q(zt|z0) as the original Markov noise process. For the Markov noise forward process where q(zt|zt 1) is defined using Qt as in Equation 2, the cumulative transition probability is computed as q(zt,i|z0) = v T(zt,i)Qtv(z0,i), where Qt = Qt Q1. Here, Qt can be computed in closed form as: Qtv(z0,i) = αtv(z0,i) + (γt βt)v(K + 1) + βt, (33) where αt = Qt 1 i=1 αi, γt = 1 Qt 1 i=1(1 γi), and βt = (1 αt γt)/(K + 1). These parameters can be calculated and stored in advance. The parameter settings follow those used during the training of the prior model. Specifically, α1 is set to 0.99999, αT to 0.000009, γ1 to 0.000009, and γT to 0.99999. For both αt and γt, values are linearly interpolated between steps 1 and T. This scheduling results in a linear increase in the number of [MASK] states as t increases, ultimately leading to all variables transitioning to the [MASK] state. Additionally, the transition probability βt between unmasked tokens is set to be negligibly small, as αt and γt sum to nearly 1. The following hyperparameters are shared across all experiments: The number of iterations for the optimization is set to 30, the temperature for Gumbel-Softmax relaxation is 1.0, and the forget coefficient is 0.3. For the classifier-free guidance scale, we use 5.0 in Image Net experiments and 3.0 in FFHQ experiments. |