Boosting Segment Anything Model Towards Open-Vocabulary Learning

Authors: Xumeng Han, Longhui Wei, Xuehui Yu, Zhiyang Dou, Xin He, Kuiran Wang, Yingfei Sun, Zhenjun Han, Qi Tian

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We follow the GLIP (Li et al. 2022a) protocol and conduct experiments to comprehensively evaluate the effectiveness of Sambor in open-vocabulary object detection. Benefiting from the effective designs, Sambor demonstrates superior open-vocabulary detection performance on COCO (Lin et al. 2014) and LVIS (Gupta, Dollar, and Girshick 2019) benchmarks.
Researcher Affiliation Collaboration 1University of Chinese Academy of Sciences 2Huawei Inc. EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology in text and illustrates it with architectural diagrams (Figure 1 and Figure 2), but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states that "MMDetection (Chen et al. 2019) code-base is used." but does not provide any explicit statement about releasing the code for the methodology described in this paper or a link to a code repository.
Open Datasets Yes For object detection, we use the Objects365 (Shao et al. 2019) dataset (referred to as O365), comprising 365 categories. For phrase grounding, we use the Gold G (Kamath et al. 2021) dataset... COCO Benchmark (Lin et al. 2014)... LVIS Benchmark (Gupta, Dollar, and Girshick 2019)
Dataset Splits Yes COCO Benchmark (Lin et al. 2014), comprising 80 common object categories... LVIS Benchmark (Gupta, Dollar, and Girshick 2019) contains 1,203 categories... We report the Fixed AP (Dave et al. 2021) on both the Mini Val (Kamath et al. 2021) subset, comprising 5,000 images, and the complete validation set v1.0... fine-tune for 1 epoch on approximately one-fifth of the O365 dataset.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory used for conducting the experiments.
Software Dependencies No The paper mentions using the "MMDetection (Chen et al. 2019) code-base" and several models like "SAM with Vi T-B (Dosovitskiy et al. 2020)" and "CLIP with RN50 64 (He et al. 2016)", and optimizers like "Adam W (Loshchilov and Hutter 2019)". However, it does not specify version numbers for MMDetection or any other software libraries or frameworks used.
Experiment Setup Yes We pre-train our models using SAM with Vi T-B (Dosovitskiy et al. 2020) as the backbone and CLIP with RN50 64 (He et al. 2016), using a batch size of 64. We select Adam W (Loshchilov and Hutter 2019) optimizer with a 0.05 weight decay, an initial learning rate 4 10 4, and a cosine annealing learning rate decay. The default training schedule is 12 epochs. The input image size is 1,024 1,024 with standard scale jittering (Ghiasi et al. 2021). ... we use a 32 32 grid of points to fine-tune for 1 epoch on approximately one-fifth of the O365 dataset. Maintaining all other hyper-parameters constant, employing a reduced learning rate of 4 10 5 contributes to the efficacy of fine-tuning.