What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

Authors: Guangkai Xu, yongtao ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on a diverse set of dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method.
Researcher Affiliation Collaboration 1 Zhejiang University, China 2 Ant Group
Pseudocode No The paper describes methods and processes in paragraph text and figures (Figure 1 and 2), but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like procedures.
Open Source Code Yes Code: https://github.com/aim-uofa/Gen Percept
Open Datasets Yes The evaluation is performed on five zero-shoft datasets including KITTI (Geiger et al., 2013), NYU (Silberman et al., 2012), Scan Net (Dai et al., 2017), DIODE (Vasiljevic et al., 2019), and ETH3D (Schops et al., 2017). We choose DIS5K (Qin et al., 2022) as the training and testing dataset. For training, we utilized the indoor synthetic dataset, Hyper Sim (Roberts et al., 2021), which comprises 40 semantic segmentation class labels. we test the model s performance on Hypersim (Roberts et al., 2021) and zero-shot ability on a subset of the ADE20k (Zhou et al., 2017) validation set, which contains overlapping classes.
Dataset Splits Yes We utilize DIS-TR for training and evaluate our model on DIS-VD and DIS-TE subsets. ... we test the model s performance on Hypersim (Roberts et al., 2021) and zero-shot ability on a subset of the ADE20k (Zhou et al., 2017) validation set, which contains overlapping classes.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for the experiments.
Software Dependencies No The paper mentions fine-tuning 'Stable Diffusion v2.1' and using a 'U-Net', but it does not specify programming languages, libraries, or other software components with their version numbers.
Experiment Setup Yes Unless specified otherwise, we freeze the VAE Auto Encoder and fine-tune the U-Net of Stable Diffusion v2.1 to estimate the ground-truth label latent for 30000 iterations, with a resolution of (768, 768), a batch size of 32, and a learning rate of 3e 5.