Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models
Authors: Joon Hyun Park, Kumju Jo, Sungyong Baik
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate outstanding performance of segmentation networks trained with our generated image-mask pairs. The results underline the effectiveness of our proposed approach in generating high-quality finegrained pixel-level annotations, without the need for a pretrained segmentation network, text prompt tuning, training a new module, or learning procedures. Experimental Settings Datasets. Following the settings of the previous work Diffu Mask (Wu et al. 2023b), we evaluated our model on the following two datasets Pascal-VOC2012 (Everingham et al. 2010) and Cityscapes (Cordts et al. 2016). Table 1: Semantic segmentation results on VOC 2012 val. Ablation Study In this section, extensive ablation studies are conducted to assess the effectiveness of each proposed module. |
| Researcher Affiliation | Academia | Joon Hyun Park1, Kumju Jo1, Sungyong Baik1, 2 1 Dept. of Artificial Intelligence, Hanyang University, South Korea 2 Dept. of Data Science, Hanyang University, South Korea EMAIL |
| Pseudocode | No | The paper describes the method using textual explanations, mathematical equations (e.g., Eq. 1-12), and figures (e.g., Figure 2: Overall framework), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/BAIKLAB/See Diff.git |
| Open Datasets | Yes | Datasets. Following the settings of the previous work Diffu Mask (Wu et al. 2023b), we evaluated our model on the following two datasets Pascal-VOC2012 (Everingham et al. 2010) and Cityscapes (Cordts et al. 2016). |
| Dataset Splits | Yes | In the Pascal VOC-2012 (Everingham et al. 2010) setting, we generate 2k and 3k images per class, using a total number of images (40.0k and 60.0k) identical to those used in previous studies such as Diffusion Dataset (Nguyen et al. 2023) and Diffu Mask (Wu et al. 2023b). Table 1 shows the results of semantic segmentation on the VOC 2012 dataset. Table 2: Module Ablations. We perform ablations of our module with VOC 2012 val, using Mask2Former with Swin B. Table 3: Result of Semantic Segmentation on Cityscapes val. |
| Hardware Specification | Yes | All experiments, including image generation and evaluation, have been conducted on NVIDIA 4090 RTX GPU. |
| Software Dependencies | No | The paper mentions using "Stable Diffusion 2-base version" and evaluating with "Mask2Former (Cheng et al. 2022)". However, it does not provide specific version numbers for underlying programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software tools that would be needed for full reproducibility. |
| Experiment Setup | Yes | We utilize the Stable Diffusion 2-base version to generate images with T = 50 timesteps as denoising step. We utilize α = 0.5 as a threshold parameter to extract the seeds and β = 0.3 as a threshold parameter to discretize a soft mask to a final mask. The settings required for training and evaluating Mask2Former, including initialization, data augmentation, batch size, weight decay, and learning rate, are configured according to the original paper. |