Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Authors: Ziqi Pang, Xin Xu, Yu-Xiong Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To conclude, we have made the following contributions to align the generative denoising process in diffusion models for perception: ... Our insights are collectively named ADDP (Aligning Diffusion Denoising with Perception). Its enhancements generalize across diverse generative diffusion-based perception models, including state-of-the-art diffusion-based depth estimator Marigold (Ke et al., 2024) and generalist Instruct CV (Gan et al., 2024). Our ADDP also extends the usability of diffusion-based perception to multi-modal referring image segmentation, where we enable a diffusion model to catch up with some discriminative baselines for the first time. 4 EXPERIMENTS
Researcher Affiliation Academia Ziqi Pang Xin Xu Yu-Xiong Wang University of Illinois Urbana-Champaign EMAIL
Pseudocode Yes Algorithm A c2 t Estimation Algorithm B Contribution-aware Timestep Sampling
Open Source Code Yes Our code is available at https://github.com/ziqipang/ADDP. We have released the code at https://github.com/ziqipang/ADDP.
Open Datasets Yes We strictly follow the setting in Marigold (Ke et al., 2024) for both training and evaluation. Concretely, the model is trained on the virtual depth maps in Hypersim (Roberts et al., 2021) and Virtual KITTI (Cabon et al., 2020) with initialization from stable diffusion 2 (Rombach et al., 2023). The evaluation is conducted in a zero-shot style on multiple real-world datasets, including NYUv2 (Silberman et al., 2012), Scan Net (Dai et al., 2017), DIODE (Vasiljevic et al., 2019), KITTI (Geiger et al., 2012), and ETH3D (Schops et al., 2017). We follow the standard practice of separately training models on Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), and G-Ref (Nagaraja et al., 2016) (UMD split), which are created from the MSCOCO dataset (Lin et al., 2014). Specifically, we use NYUv2 (Silberman et al., 2012) for depth estimation, ADE20K (Zhou et al., 2017; 2019) for semantic segmentation, and COCO (Lin et al., 2014) for object detection.
Dataset Splits Yes We follow the standard practice of separately training models on Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), and G-Ref (Nagaraja et al., 2016) (UMD split), then evaluate on their validation and test sets. We use N =1,000 validation samples from Ref COCO (Yu et al., 2016) to estimate the results with Io U. We empirically reweight the ratio of training samples from each dataset to 0.3, 0.3, and 0.4, respectively.
Hardware Specification No This work used computational resources, including the NCSA Delta and Delta AI supercomputers through allocations CIS230012 and CIS240387 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, as well as the TACC Frontera supercomputer, Amazon Web Services (AWS), and Open AI API through the National Artificial Intelligence Research Resource (NAIRR) Pilot.
Software Dependencies Yes We first prompt LLa VA (Liu et al., 2023b;c) to create detailed caption of each image, specifically the llava v1.6 vicuna 13b...Then we prompt GPT4 (Achiam et al., 2023) to analyze the captions and referring expressions to name the confusing objects for referring segmentation...Specifically, we use gpt 4o 2024 05 13 to conduct the prompts in the middle of Fig. 5.
Experiment Setup Yes During training, we will resize all the images to the resolution of 256 256, optimizing with the Adam W optimizer (Kingma, 2014; Loshchilov, 2017), batch size of 128, learning rate of 10 4, and cosine annealing scheduler (Loshchilov & Hutter, 2016). The classifier-free guidance (Ho & Salimans, 2022) weights are manually tuned on the Ref COCO validation set to guarantee optimal performance, which is 1.5 for image conditioning and 7.5 for text conditioning in the Instruct Pix2Pix model and 1.5 for image conditioning and 3.0 for text conditioning in our enhanced model. The models presented in Table 2 are trained with 60 epochs, where each epoch indicates enumerating each image once. The model is trained for 20 epochs, with 100k iterations per epoch.