DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Authors: Shuo Zhang, Jiaming Huang, Wenbing Tang, Yan Wu, Terrence Hu, Xiaogang Xu, Jing Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Di MSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods.
Researcher Affiliation Collaboration 1Shanghai Key Laboratory of Trustworthy Computing, East China Normal University 2Technology Center, Huolala 3College of Computing and Data Science, Nanyang Technological University 4The Chinese University of Hong Kong shuo zhang EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the network architecture and methods in detail but does not contain a specific pseudocode or algorithm block.
Open Source Code No The code for evaluating the model is derived from F3Net. (This refers to a third-party evaluation code, not the source code for the proposed Di MSOD model itself. There is no explicit statement about releasing their own code.)
Open Datasets Yes Di MSOD is trained jointly using three different types of SOD datasets, following recent work, our training dataset consists of the following subsets and resize it to 512 512 : the RGB dataset DUTS-TR (Wang et al. 2017) with 10,553 images, the RGB-T dataset VT5000 (Tu et al. 2022b) with 2,500 images, the RGB-D dataset NJUD (Ju et al. 2014) with 1,485 image, NLPR (Peng et al. 2014) with 700 images, and DUTLF-Depth (Piao et al. 2019) with 800 images. Stable Diffusion is used as our backbone when implementing Di MSOD.
Dataset Splits Yes Di MSOD is trained jointly using three different types of SOD datasets, following recent work, our training dataset consists of the following subsets and resize it to 512 512 : the RGB dataset DUTS-TR (Wang et al. 2017) with 10,553 images, the RGB-T dataset VT5000 (Tu et al. 2022b) with 2,500 images, the RGB-D dataset NJUD (Ju et al. 2014) with 1,485 image, NLPR (Peng et al. 2014) with 700 images, and DUTLF-Depth (Piao et al. 2019) with 800 images. For RGB datasets, we evaluate Di MSOD on 5 widely used benchmark datasets that are not seen during training, including DUT-OMRON (5,168 images), ECSSD (1,000 images), PASCAL-S (850 images), HKU-IS (4,447 images), and DUTS-TE (5,019 Images). For RGB-D datasets , we use the test sets of DUTLF-Depth (400 images), NJUD (500 images), NLPR (300 images), SIP (929 images), LFSD (100 images). For RGB-T datasets , we use the testset of VT5000 (2,500 images) ,VT821 (821 images) , VT1000 (1,000 images).
Hardware Specification Yes Training our method takes 100 epochs with a batch size of 32 on 4 Nvidia A100 GPU cards.
Software Dependencies No Stable Diffusion is used as our backbone when implementing Di MSOD. The initial pre-training configurations with a v-objective (Salimans and Ho 2022) are adhered to our experiments. In training, we implement the DDPM noise scheduler (Ho, Jain, and Abbeel 2020b) with 1,000 diffusion steps. For inference, we employ DDIM scheduler and sample 20 steps. (No specific version numbers for libraries like PyTorch, CUDA, or specific Stable Diffusion version are provided.)
Experiment Setup Yes In training, we implement the DDPM noise scheduler (Ho, Jain, and Abbeel 2020b) with 1,000 diffusion steps. For inference, we employ DDIM scheduler and sample 20 steps. For the final prediction, we combine outcomes from 10 inference iterations initiated with diverse initial noise. Training our method takes 100 epochs with a batch size of 32 on 4 Nvidia A100 GPU cards. We adopt the Adam optimizer with a learning rate of 3 10 5. We also implement training data augmentation strategies through the application of random horizontal and vertical flips.