reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Authors: Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-Xiong Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness. Our project website is available at https://zsh2000.github.io/diff-2-in-1.github.io/. 5 EXPERIMENTAL EVALUATION 5.1 EVALUATION SETUP We first evaluate our proposed Diff-2-in-1 in the single-task settings with surface normal estimation and semantic segmentation as targets. Next, we apply Diff-2-in-1 in multi-task settings of NYUDMT (Silberman et al., 2012) and PASCAL-Context (Mottaghi et al., 2014) to show that it can provide universal benefit for more tasks simultaneously. Datasets and metrics. We evaluate surface normal estimation on the NYUv2 (Silberman et al., 2012; Ladicky et al., 2014) dataset. Different from previous methods that leverage additional raw data for training, we only use the 795 training samples. We include the number of training samples for each method in Table 1 for reference. Following Bae et al. (2021) and i Disc (Piccinelli et al., 2023), we adopt 11.25 , 22.5 , 30 to measure the percentage of pixels with lower angle error than the corresponding thresholds. We also report the mean/median angle error and the root mean square
Researcher Affiliation	Academia	1University of Illinois Urbana-Champaign, 2Carnegie Mellon University, 3Tsinghua University EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes methods and processes in text and figures but does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks, nor any structured code-like procedures.
Open Source Code	Yes	Our project website is available at https://zsh2000.github.io/diff-2-in-1.github.io/.
Open Datasets	Yes	Datasets and metrics. We evaluate surface normal estimation on the NYUv2 (Silberman et al., 2012; Ladicky et al., 2014) dataset. Semantic segmentation. We instantiate our Diff-2-in-1 on VPD (Zhao et al., 2023), a diffusionbased segmentation model. For self-improving, we synthesize one sample for each image in the training set. Multi-task evaluations. We apply our Diff-2-in-1 on two state-of-the-art multi-task methods, Inv PT (Ye & Xu, 2022) and Task Prompter (Ye & Xu, 2023). A total of 500 synthetic samples are generated for NYUD-MT following the surface normal evaluation. For PASCAL-Context, one sample is synthesized for each image in the training set with our Diff-2-in-1. The comparisons on NYUD-MT and PASCAL-Context are shown in Table 3 and Table 4, respectively. We train both the baseline Bae et al. (2021) and our Diff-2-in-1 on the Scan Net (Dai et al., 2017) dataset for the surface normal estimation task, and evaluate the performance on the test set of NYUv2 (Silberman et al., 2012; Ladicky et al., 2014).
Dataset Splits	Yes	We evaluate surface normal estimation on the NYUv2 (Silberman et al., 2012; Ladicky et al., 2014) dataset. Different from previous methods that leverage additional raw data for training, we only use the 795 training samples. We include the number of training samples for each method in Table 1 for reference. For multi-task evaluations, NYUD-MT spans across three tasks including semantic segmentation, monocular depth estimation, and surface normal estimation; PASCAL-Context takes semantic segmentation, human parsing, saliency detection, and surface normal estimation for evaluation. We run this ablation for semantic segmentation on the ADE20K dataset: we randomly select 10% (2K) to 90% (18K) samples with 10% (2K) intervals in between, assuming that Diff-2-in-1 only gets access to partial data.
Hardware Specification	Yes	This work used computational resources, including the NCSA Delta and Delta AI supercomputers through allocations CIS220014 and CIS230012 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, as well as the TACC Frontera supercomputer and Amazon Web Services (AWS) through the National Artificial Intelligence Research Resource (NAIRR) Pilot.
Software Dependencies	No	The paper mentions several models and frameworks like "latent diffusion model (LDM) (Rombach et al., 2022)", "stable diffusion model (Rombach et al., 2022)", "U-Net (Ronneberger et al., 2015)", "Stable Diffusion v1-5", and "BLIP-2 (Li et al., 2023b)". However, it does not provide specific version numbers for ancillary software libraries or programming languages (e.g., Python, PyTorch, CUDA) required to replicate the experiments.
Experiment Setup	Yes	In the warm-up stage, we follow the same hyperparameters of the learning rate, optimizer, and training epochs of the original works that our Diff-2-in-1 builds on. In the self-improving stage, the exploitation parameter θE continues the same training scheme in the warm-up stage, while the creation parameter θC updates once when θE consumes 40 samples. Thus, the interval of the EMA update for θC depends on the batch size used in the self-improving stage. For the surface normal estimation and semantic segmentation tasks, we adopt a batch size of 4, so the EMA update happens every 10 iterations. For the multi-task frameworks, the batch size is 1, so we perform the EMA update every 40 iterations. The momentum hyperparameter α for the EMA update is set as 0.999 for multi-task learning on PASCAL-Context (Mottaghi et al., 2014), and 0.998 for the rest of the task settings.