Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Authors: Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-Xiong Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness. Our project website is available at https://zsh2000.github.io/diff-2-in-1.github.io/. 5 EXPERIMENTAL EVALUATION 5.1 EVALUATION SETUP We first evaluate our proposed Diff-2-in-1 in the single-task settings with surface normal estimation and semantic segmentation as targets. Next, we apply Diff-2-in-1 in multi-task settings of NYUDMT (Silberman et al., 2012) and PASCAL-Context (Mottaghi et al., 2014) to show that it can provide universal benefit for more tasks simultaneously. Datasets and metrics. We evaluate surface normal estimation on the NYUv2 (Silberman et al., 2012; Ladicky et al., 2014) dataset. Different from previous methods that leverage additional raw data for training, we only use the 795 training samples. We include the number of training samples for each method in Table 1 for reference. Following Bae et al. (2021) and i Disc (Piccinelli et al., 2023), we adopt 11.25 , 22.5 , 30 to measure the percentage of pixels with lower angle error than the corresponding thresholds. We also report the mean/median angle error and the root mean square
Researcher Affiliation Academia 1University of Illinois Urbana-Champaign, 2Carnegie Mellon University, 3Tsinghua University EMAIL EMAIL EMAIL
Pseudocode No The paper describes methods and processes in text and figures but does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks, nor any structured code-like procedures.
Open Source Code Yes Our project website is available at https://zsh2000.github.io/diff-2-in-1.github.io/.
Open Datasets Yes Datasets and metrics. We evaluate surface normal estimation on the NYUv2 (Silberman et al., 2012; Ladicky et al., 2014) dataset. Semantic segmentation. We instantiate our Diff-2-in-1 on VPD (Zhao et al., 2023), a diffusionbased segmentation model. For self-improving, we synthesize one sample for each image in the training set. Multi-task evaluations. We apply our Diff-2-in-1 on two state-of-the-art multi-task methods, Inv PT (Ye & Xu, 2022) and Task Prompter (Ye & Xu, 2023). A total of 500 synthetic samples are generated for NYUD-MT following the surface normal evaluation. For PASCAL-Context, one sample is synthesized for each image in the training set with our Diff-2-in-1. The comparisons on NYUD-MT and PASCAL-Context are shown in Table 3 and Table 4, respectively. We train both the baseline Bae et al. (2021) and our Diff-2-in-1 on the Scan Net (Dai et al., 2017) dataset for the surface normal estimation task, and evaluate the performance on the test set of NYUv2 (Silberman et al., 2012; Ladicky et al., 2014).
Dataset Splits Yes We evaluate surface normal estimation on the NYUv2 (Silberman et al., 2012; Ladicky et al., 2014) dataset. Different from previous methods that leverage additional raw data for training, we only use the 795 training samples. We include the number of training samples for each method in Table 1 for reference. For multi-task evaluations, NYUD-MT spans across three tasks including semantic segmentation, monocular depth estimation, and surface normal estimation; PASCAL-Context takes semantic segmentation, human parsing, saliency detection, and surface normal estimation for evaluation. We run this ablation for semantic segmentation on the ADE20K dataset: we randomly select 10% (2K) to 90% (18K) samples with 10% (2K) intervals in between, assuming that Diff-2-in-1 only gets access to partial data.
Hardware Specification Yes This work used computational resources, including the NCSA Delta and Delta AI supercomputers through allocations CIS220014 and CIS230012 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, as well as the TACC Frontera supercomputer and Amazon Web Services (AWS) through the National Artificial Intelligence Research Resource (NAIRR) Pilot.
Software Dependencies No The paper mentions several models and frameworks like "latent diffusion model (LDM) (Rombach et al., 2022)", "stable diffusion model (Rombach et al., 2022)", "U-Net (Ronneberger et al., 2015)", "Stable Diffusion v1-5", and "BLIP-2 (Li et al., 2023b)". However, it does not provide specific version numbers for ancillary software libraries or programming languages (e.g., Python, PyTorch, CUDA) required to replicate the experiments.
Experiment Setup Yes In the warm-up stage, we follow the same hyperparameters of the learning rate, optimizer, and training epochs of the original works that our Diff-2-in-1 builds on. In the self-improving stage, the exploitation parameter θE continues the same training scheme in the warm-up stage, while the creation parameter θC updates once when θE consumes 40 samples. Thus, the interval of the EMA update for θC depends on the batch size used in the self-improving stage. For the surface normal estimation and semantic segmentation tasks, we adopt a batch size of 4, so the EMA update happens every 10 iterations. For the multi-task frameworks, the batch size is 1, so we perform the EMA update every 40 iterations. The momentum hyperparameter α for the EMA update is set as 0.999 for multi-task learning on PASCAL-Context (Mottaghi et al., 2014), and 0.998 for the rest of the task settings.