Class Distribution-induced Attention Map for Open-vocabulary Semantic Segmentations
Authors: Dong Un Kang, Hayeon Kim, Se Young Chun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our CDAM method on the three widely used benchmark datasets that include a background class: PASCAL VOC (Everingham et al., 2010), PASCAL Context (Mottaghi et al., 2014) and COCO-Object (Lin et al., 2014). All three datasets include a background class, which is separate from the foreground classes. These datasets have 20, 59, and 80 foreground classes, respectively. The validation sets contain 1449, 5105, and 5000 images, respectively. We also use three additional benchmark datasets that do not include a background class: City Scapes (Cordts et al., 2016), ADE20K (Zhou et al., 2017), and COCO-Stuff (Lin et al., 2014), which have 19, 150, and 171 classes, respectively. |
| Researcher Affiliation | Academia | Dong Un Kang1, Hayeon Kim1, Se Young Chun1,2, 1Department of ECE, 2INMC & IPAI, Seoul National University EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations, mathematical formulations, and diagrams (e.g., Figure 1 for the overall pipeline). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps. |
| Open Source Code | Yes | Code is available at https://janeyeon.github.io/cdamclip. |
| Open Datasets | Yes | We evaluate our CDAM method on the three widely used benchmark datasets that include a background class: PASCAL VOC (Everingham et al., 2010), PASCAL Context (Mottaghi et al., 2014) and COCO-Object (Lin et al., 2014). We also use three additional benchmark datasets that do not include a background class: City Scapes (Cordts et al., 2016), ADE20K (Zhou et al., 2017), and COCO-Stuff (Lin et al., 2014). |
| Dataset Splits | Yes | The validation sets contain 1449, 5105, and 5000 images, respectively. We follow the unified evaluation protocol by TCL (Cha et al., 2023) in open-vocabulary semantic segmentation. This protocol ensures no access to target data before evaluation. |
| Hardware Specification | Yes | All measurements were performed on an NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions using "CLIP Vi T/B-16 model from Open CLIP (Radford et al., 2021)" but does not specify software dependencies like programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x) with version numbers. |
| Experiment Setup | Yes | The input image is resized to 224 x 224 pixels, and the patch size is set to 16 x 16 pixels. Following the experimental settings of Group Vi T (Xu et al., 2022a), we resize input images to have the shorter side of 448 pixels and employ the mean Intersection-over-Union (m Io U) metric, which is generally used for evaluating semantic segmentation performance. The temperature τ and the modulation of entropy α are set to 0.1 and 2.5, respectively. The set of scaling factor M is {0.25, 0.37, 0.5, 0.63, 0.75, 0.87, 1.0}. |