A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

Authors: Thomas Stegmüller, Tim Lebailly, Nikola Đukić, Behzad Bozorgtabar, Tinne Tuytelaars, Jean-Philippe Thiran

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS In this section, we investigate the properties of Sim ZSS through various experiments. Additional experiments can be found in Appendix A. 4.1 EXPERIMENTAL SETUP 4.2 ZERO-SHOT SEGMENTATION OF FOREGROUND 4.3 ZERO-SHOT SEGMENTATION 4.4 ZERO-SHOT CLASSIFICATION
Researcher Affiliation Academia Thomas Stegmüller1 Tim Lebailly2 Nikola Ðuki c2 Behzad Bozorgtabar1,3 Tinne Tuytelaars2 Jean-Philippe Thiran1,3 1EPFL 2KU Leuven 3CHUV 1{firstname}.{lastname}@epfl.ch 2{firstname}.{lastname}@esat.kuleuven.be
Pseudocode No The paper describes its methodology through textual explanations and mathematical equations (1) to (10), but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured, code-like formatted steps for its procedures.
Open Source Code Yes Our code and pretrained models are publicly available at https://github.com/tileb1/simzss.
Open Datasets Yes 4.1.1 PRETRAINING DATASETS We train our models on two distinct datasets: COCO Captions (Lin et al., 2014; Chen et al., 2015) and LAION-400M (Schuhmann et al., 2021). ... We report the m Io U scores across five standard datasets, namely Pascal VOC (Everingham et al., 2012), Pascal Context (Mottaghi et al., 2014), COCO-Stuff (Caesar et al., 2018), Cityscapes (Cordts et al., 2016) and ADE20K (Zhou et al., 2017). ... All datasets used in our work are publicly available.
Dataset Splits Yes We follow the MMSegmentation (Contributors, 2020) implementation of Cha et al. (2023). ... Pascal VOC 2012 The Pascal VOC dataset (Everingham et al., 2010) contains 20 classes with semantic segmentation annotations. The training set consists of 1,464 images, while the validation set includes 1,449 images. ... Pascal Context ... The training set contains approximately 4,998 images, while the validation set includes around 5,105 images. ... COCO-Stuff ... It includes over 164K images for training and 20K images for validation... ADE20K ... The training set includes 20,210 images, and the validation set consists of 2,000 images. Cityscapes ... The training set includes 2,975 images, and the validation set includes 500 images...
Hardware Specification Yes Experiments are conducted on a single node with 4x AMD MI250x GPUs (2 compute dies per GPU, i.e., worldsize = 8) with a memory usage of 38GB per compute die.
Software Dependencies No We use the en_core_web_trf model from Spa Cy (Honnibal et al., 2020) as part-of-speech tagger to identify noun phrases. ... We follow the MMSegmentation (Contributors, 2020) implementation of Cha et al. (2023). The paper mentions specific software components like SpaCy and MMSegmentation, including a specific SpaCy model ('en_core_web_trf'), but it does not provide explicit version numbers for these software dependencies as required for reproducibility.
Experiment Setup Yes For COCO Captions, we conduct training over 4M processed samples ( 6.6 epochs) using a global batchsize of 16,384. We incorporate a warm-up strategy spanning 10% of the training steps, linearly ramping up the learning rate until it reaches its peak value, chosen from the set {8e-5, 3e-5, 8e-6, 3e-6}1. Subsequently, we employ a cosine decay schedule for the remaining steps. Similarly, for LAION-400M, we train for 1 epoch with a global batchsize of 32,768, and we set the learning rate from the options {3e-5, 8e-6, 3e-6, 8e-7, 3e-7}. ... The overall objective of Sim ZSS, denoted as Ltot, is a weighted sum of the global and local consistency objectives: Ltot = Lg + λLl (10) where λ is a weighting parameter whose impact is ablated in Table 9. ... where τ is a temperature parameter that regulates the sharpness of the similarity distribution (see Tab. 9). ... After a grid search on λ, τ, and the learning rate, we find that the best-performing setting for training on COCO Captions is (λ = 0.05, τ = 0.1).