SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation
Authors: Pengfei Chen, Lingxi Xie, xinyue huo, Xuehui Yu, XIAOPENG ZHANG, Yingfei Sun, Zhenjun Han, Qi Tian
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains. In particular, it achieves the state-of-the-art performance in open-vocabulary segmentation. Our research offers a novel and generalized methodology for equipping vision foundation models like SAM with multi-grained semantic perception abilities. Codes are released on github.com/ucas-vg/SAM-CP. (...) We train SAM-CP on COCO Lin & Maire (2014) and ADE20K Zhou et al. (2017) and evaluate it on these two datasets as well as Cityscapes Cordts et al. (2016) for both open-vocabulary and closed-domain segmentation. (...) Extensive experiments demonstrate SAM-CP s ability to cover semantic, instance, and panoptic segmentation with a single model. |
| Researcher Affiliation | Collaboration | Pengfei Chen1,2 Lingxi Xie2 Xinyue Huo2 Xuehui Yu1 Xiaopeng Zhang2 Yingfei Sun1 Zhenjun Han1 Qi Tian2 1 University of Chinese Academy of Sciences 2 Huawei Inc. |
| Pseudocode | Yes | Algorithm 1 Affinity Similarity Calculation Input: Query vectors Q, Patch features K, Head number η, Stage number ω. Output: Affinity similarity ˆA. Note: Q RM D, K RN D, where M and N is the number of Q and K. D is the feature dimension, which is a multiple of η. s R1, b0 RD and b1 RD are the learnable scaling factor and bias parameters to initialize the score to 0.01 for the focal loss. |
| Open Source Code | Yes | Codes are released on github.com/ucas-vg/SAM-CP. |
| Open Datasets | Yes | The datasets are public datasets; their links are provided here: COCO: https://cocodataset.org/ ADE20K: http://groups.csail.mit.edu/vision/datasets/ADE20K/ Cityscapes: https://www.cityscapes-dataset.com/ |
| Dataset Splits | Yes | We train SAM-CP on the COCO-Panoptic Lin & Maire (2014) and ADE20K Zhou et al. (2017) datasets (...) COCO-Panoptic (the 2017 version) has 118K training and 5K validation images with 80 thing and 53 stuff categories. |
| Hardware Specification | Yes | We use 8 Tesla-V100 GPUs (4/2 images per GPU) for open-vocabulary/closed-domain experiments. (...) All experiments were carried out using an RTX 4090. |
| Software Dependencies | Yes | We use the implementation of the MMDetection Chen et al. (2019) (v3.0) library. |
| Experiment Setup | Yes | We use loss weights of 2.0, 1.0, and 1.0 for Lcls, Lmfl, and Ldice, and metric weights of 2.0, 1.0, 1.0, 1.0, and 1.0 for the cls, mfl, dice, bbox, and giou terms in the Hungarian matching algorithm. The batch size is 32, where we use 8 Tesla-V100 GPUs (4 images per GPU) for the R50 experiments and 32 GPUs (1 image per GPU) for the Swin-L experiments. The learning rate is set to be 10 5 at the beginning and is multiplied by 0.1 after the 8/16/40-th epoch when the total number of epochs is 12/24/50. The threshold τ in Section 3.2.3 is set to be 0.8. The number of patch encoders and unified affinity decoders are both set to be 6. |