Class Distribution-induced Attention Map for Open-vocabulary Semantic Segmentations

Authors: Dong Un Kang, Hayeon Kim, Se Young Chun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our CDAM method on the three widely used benchmark datasets that include a background class: PASCAL VOC (Everingham et al., 2010), PASCAL Context (Mottaghi et al., 2014) and COCO-Object (Lin et al., 2014). All three datasets include a background class, which is separate from the foreground classes. These datasets have 20, 59, and 80 foreground classes, respectively. The validation sets contain 1449, 5105, and 5000 images, respectively. We also use three additional benchmark datasets that do not include a background class: City Scapes (Cordts et al., 2016), ADE20K (Zhou et al., 2017), and COCO-Stuff (Lin et al., 2014), which have 19, 150, and 171 classes, respectively.
Researcher Affiliation Academia Dong Un Kang1, Hayeon Kim1, Se Young Chun1,2, 1Department of ECE, 2INMC & IPAI, Seoul National University EMAIL
Pseudocode No The paper describes the methodology using textual explanations, mathematical formulations, and diagrams (e.g., Figure 1 for the overall pipeline). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code Yes Code is available at https://janeyeon.github.io/cdamclip.
Open Datasets Yes We evaluate our CDAM method on the three widely used benchmark datasets that include a background class: PASCAL VOC (Everingham et al., 2010), PASCAL Context (Mottaghi et al., 2014) and COCO-Object (Lin et al., 2014). We also use three additional benchmark datasets that do not include a background class: City Scapes (Cordts et al., 2016), ADE20K (Zhou et al., 2017), and COCO-Stuff (Lin et al., 2014).
Dataset Splits Yes The validation sets contain 1449, 5105, and 5000 images, respectively. We follow the unified evaluation protocol by TCL (Cha et al., 2023) in open-vocabulary semantic segmentation. This protocol ensures no access to target data before evaluation.
Hardware Specification Yes All measurements were performed on an NVIDIA A100 GPU.
Software Dependencies No The paper mentions using "CLIP Vi T/B-16 model from Open CLIP (Radford et al., 2021)" but does not specify software dependencies like programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x) with version numbers.
Experiment Setup Yes The input image is resized to 224 x 224 pixels, and the patch size is set to 16 x 16 pixels. Following the experimental settings of Group Vi T (Xu et al., 2022a), we resize input images to have the shorter side of 448 pixels and employ the mean Intersection-over-Union (m Io U) metric, which is generally used for evaluating semantic segmentation performance. The temperature τ and the modulation of entropy α are set to 0.1 and 2.5, respectively. The set of scaling factor M is {0.25, 0.37, 0.5, 0.63, 0.75, 0.87, 1.0}.