reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Authors: Ziqi Pang, Xin Xu, Yu-Xiong Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To conclude, we have made the following contributions to align the generative denoising process in diffusion models for perception: ... Our insights are collectively named ADDP (Aligning Diffusion Denoising with Perception). Its enhancements generalize across diverse generative diffusion-based perception models, including state-of-the-art diffusion-based depth estimator Marigold (Ke et al., 2024) and generalist Instruct CV (Gan et al., 2024). Our ADDP also extends the usability of diffusion-based perception to multi-modal referring image segmentation, where we enable a diffusion model to catch up with some discriminative baselines for the first time. 4 EXPERIMENTS
Researcher Affiliation	Academia	Ziqi Pang Xin Xu Yu-Xiong Wang University of Illinois Urbana-Champaign EMAIL
Pseudocode	Yes	Algorithm A c2 t Estimation Algorithm B Contribution-aware Timestep Sampling
Open Source Code	Yes	Our code is available at https://github.com/ziqipang/ADDP. We have released the code at https://github.com/ziqipang/ADDP.
Open Datasets	Yes	We strictly follow the setting in Marigold (Ke et al., 2024) for both training and evaluation. Concretely, the model is trained on the virtual depth maps in Hypersim (Roberts et al., 2021) and Virtual KITTI (Cabon et al., 2020) with initialization from stable diffusion 2 (Rombach et al., 2023). The evaluation is conducted in a zero-shot style on multiple real-world datasets, including NYUv2 (Silberman et al., 2012), Scan Net (Dai et al., 2017), DIODE (Vasiljevic et al., 2019), KITTI (Geiger et al., 2012), and ETH3D (Schops et al., 2017). We follow the standard practice of separately training models on Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), and G-Ref (Nagaraja et al., 2016) (UMD split), which are created from the MSCOCO dataset (Lin et al., 2014). Specifically, we use NYUv2 (Silberman et al., 2012) for depth estimation, ADE20K (Zhou et al., 2017; 2019) for semantic segmentation, and COCO (Lin et al., 2014) for object detection.
Dataset Splits	Yes	We follow the standard practice of separately training models on Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), and G-Ref (Nagaraja et al., 2016) (UMD split), then evaluate on their validation and test sets. We use N =1,000 validation samples from Ref COCO (Yu et al., 2016) to estimate the results with Io U. We empirically reweight the ratio of training samples from each dataset to 0.3, 0.3, and 0.4, respectively.
Hardware Specification	No	This work used computational resources, including the NCSA Delta and Delta AI supercomputers through allocations CIS230012 and CIS240387 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, as well as the TACC Frontera supercomputer, Amazon Web Services (AWS), and Open AI API through the National Artificial Intelligence Research Resource (NAIRR) Pilot.
Software Dependencies	Yes	We first prompt LLa VA (Liu et al., 2023b;c) to create detailed caption of each image, specifically the llava v1.6 vicuna 13b...Then we prompt GPT4 (Achiam et al., 2023) to analyze the captions and referring expressions to name the confusing objects for referring segmentation...Specifically, we use gpt 4o 2024 05 13 to conduct the prompts in the middle of Fig. 5.
Experiment Setup	Yes	During training, we will resize all the images to the resolution of 256 256, optimizing with the Adam W optimizer (Kingma, 2014; Loshchilov, 2017), batch size of 128, learning rate of 10 4, and cosine annealing scheduler (Loshchilov & Hutter, 2016). The classifier-free guidance (Ho & Salimans, 2022) weights are manually tuned on the Ref COCO validation set to guarantee optimal performance, which is 1.5 for image conditioning and 7.5 for text conditioning in the Instruct Pix2Pix model and 1.5 for image conditioning and 3.0 for text conditioning in our enhanced model. The models presented in Table 2 are trained with 60 epochs, where each epoch indicates enumerating each image once. The model is trained for 20 epochs, with 100k iterations per epoch.