Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
Authors: Ziqi Pang, Xin Xu, Yu-Xiong Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To conclude, we have made the following contributions to align the generative denoising process in diffusion models for perception: ... Our insights are collectively named ADDP (Aligning Diffusion Denoising with Perception). Its enhancements generalize across diverse generative diffusion-based perception models, including state-of-the-art diffusion-based depth estimator Marigold (Ke et al., 2024) and generalist Instruct CV (Gan et al., 2024). Our ADDP also extends the usability of diffusion-based perception to multi-modal referring image segmentation, where we enable a diffusion model to catch up with some discriminative baselines for the first time. 4 EXPERIMENTS |
| Researcher Affiliation | Academia | Ziqi Pang Xin Xu Yu-Xiong Wang University of Illinois Urbana-Champaign EMAIL |
| Pseudocode | Yes | Algorithm A c2 t Estimation Algorithm B Contribution-aware Timestep Sampling |
| Open Source Code | Yes | Our code is available at https://github.com/ziqipang/ADDP. We have released the code at https://github.com/ziqipang/ADDP. |
| Open Datasets | Yes | We strictly follow the setting in Marigold (Ke et al., 2024) for both training and evaluation. Concretely, the model is trained on the virtual depth maps in Hypersim (Roberts et al., 2021) and Virtual KITTI (Cabon et al., 2020) with initialization from stable diffusion 2 (Rombach et al., 2023). The evaluation is conducted in a zero-shot style on multiple real-world datasets, including NYUv2 (Silberman et al., 2012), Scan Net (Dai et al., 2017), DIODE (Vasiljevic et al., 2019), KITTI (Geiger et al., 2012), and ETH3D (Schops et al., 2017). We follow the standard practice of separately training models on Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), and G-Ref (Nagaraja et al., 2016) (UMD split), which are created from the MSCOCO dataset (Lin et al., 2014). Specifically, we use NYUv2 (Silberman et al., 2012) for depth estimation, ADE20K (Zhou et al., 2017; 2019) for semantic segmentation, and COCO (Lin et al., 2014) for object detection. |
| Dataset Splits | Yes | We follow the standard practice of separately training models on Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), and G-Ref (Nagaraja et al., 2016) (UMD split), then evaluate on their validation and test sets. We use N =1,000 validation samples from Ref COCO (Yu et al., 2016) to estimate the results with Io U. We empirically reweight the ratio of training samples from each dataset to 0.3, 0.3, and 0.4, respectively. |
| Hardware Specification | No | This work used computational resources, including the NCSA Delta and Delta AI supercomputers through allocations CIS230012 and CIS240387 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, as well as the TACC Frontera supercomputer, Amazon Web Services (AWS), and Open AI API through the National Artificial Intelligence Research Resource (NAIRR) Pilot. |
| Software Dependencies | Yes | We first prompt LLa VA (Liu et al., 2023b;c) to create detailed caption of each image, specifically the llava v1.6 vicuna 13b...Then we prompt GPT4 (Achiam et al., 2023) to analyze the captions and referring expressions to name the confusing objects for referring segmentation...Specifically, we use gpt 4o 2024 05 13 to conduct the prompts in the middle of Fig. 5. |
| Experiment Setup | Yes | During training, we will resize all the images to the resolution of 256 256, optimizing with the Adam W optimizer (Kingma, 2014; Loshchilov, 2017), batch size of 128, learning rate of 10 4, and cosine annealing scheduler (Loshchilov & Hutter, 2016). The classifier-free guidance (Ho & Salimans, 2022) weights are manually tuned on the Ref COCO validation set to guarantee optimal performance, which is 1.5 for image conditioning and 7.5 for text conditioning in the Instruct Pix2Pix model and 1.5 for image conditioning and 3.0 for text conditioning in our enhanced model. The models presented in Table 2 are trained with 60 epochs, where each epoch indicates enumerating each image once. The model is trained for 20 epochs, with 100k iterations per epoch. |