What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?
Authors: Guangkai Xu, yongtao ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments on a diverse set of dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method. |
| Researcher Affiliation | Collaboration | 1 Zhejiang University, China 2 Ant Group |
| Pseudocode | No | The paper describes methods and processes in paragraph text and figures (Figure 1 and 2), but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like procedures. |
| Open Source Code | Yes | Code: https://github.com/aim-uofa/Gen Percept |
| Open Datasets | Yes | The evaluation is performed on five zero-shoft datasets including KITTI (Geiger et al., 2013), NYU (Silberman et al., 2012), Scan Net (Dai et al., 2017), DIODE (Vasiljevic et al., 2019), and ETH3D (Schops et al., 2017). We choose DIS5K (Qin et al., 2022) as the training and testing dataset. For training, we utilized the indoor synthetic dataset, Hyper Sim (Roberts et al., 2021), which comprises 40 semantic segmentation class labels. we test the model s performance on Hypersim (Roberts et al., 2021) and zero-shot ability on a subset of the ADE20k (Zhou et al., 2017) validation set, which contains overlapping classes. |
| Dataset Splits | Yes | We utilize DIS-TR for training and evaluate our model on DIS-VD and DIS-TE subsets. ... we test the model s performance on Hypersim (Roberts et al., 2021) and zero-shot ability on a subset of the ADE20k (Zhou et al., 2017) validation set, which contains overlapping classes. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions fine-tuning 'Stable Diffusion v2.1' and using a 'U-Net', but it does not specify programming languages, libraries, or other software components with their version numbers. |
| Experiment Setup | Yes | Unless specified otherwise, we freeze the VAE Auto Encoder and fine-tune the U-Net of Stable Diffusion v2.1 to estimate the ground-truth label latent for 30000 iterations, with a resolution of (768, 768), a batch size of 32, and a learning rate of 3e 5. |