RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything

Authors: Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that RMP-SAM is effective and generalizes well on proposed benchmarks and other specific semantic tasks. Our benchmark uses COCO (Lin et al., 2014) and You Tube-VIS 2019 (Yang et al., 2019) datasets for benchmarking. Evaluation Metrics and Devices. For image segmentation, we adopt panoptic quality (PQ)... For the VIS task, we use m AP as our primary metric. Ablation on Meta-Architecture Design. In Tab. 6(a), we further explore the meta-architecture design, as shown in Fig. 3.
Researcher Affiliation Collaboration 1Peking University, 2Nanyang Technology University, 3UC, Merced 4Shanghai AI Laboratory, 5KAUST, 6Google Research EMAIL, EMAIL
Pseudocode No The paper describes methods using prose and mathematical equations but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is released at https://github.com/xushilin1/RAP-SAM
Open Datasets Yes Datasets. Our benchmark uses COCO (Lin et al., 2014) and You Tube-VIS 2019 (Yang et al., 2019) datasets for benchmarking. ... In addition, to verify the effectiveness and generality of RMP-SAM, we also use other datasets, including ADE-20K (Zhou et al., 2017) and VIP-Seg dataset (Miao et al., 2022) in Sec. 4.1.
Dataset Splits Yes Datasets. Our benchmark uses COCO (Lin et al., 2014) and You Tube-VIS 2019 (Yang et al., 2019) datasets for benchmarking. ... For benchmarking, we use the well-known COCO dataset (Lin et al., 2014) for panoptic and interactive segmentation. ... For video segmentation, we adopt the widely used You Tube-VIS 2019 dataset (Yang et al., 2019) for training. ... We further explore the impact of each dataset in Tab. 6(b). We find joint co-training with image and video data leads to better performance for video instance segmentation but reduces the performance of panoptic segmentation. ... For joint dataset training, we sample clip images with a ratio of 1: 25: 1 for COCO-panoptic, Youtube-2019, and SAM datasets.
Hardware Specification Yes Speed testing is conducted fairly on one A100 GPU for all models. ... We use the distributed training framework with 16 A100 GPUs. ... The FPS is obtained on one 40GB A100 GPU for the main paper and supplementary. ... We have evaluated several methods from Table 2 across multiple GPU platforms, including A100-40G, A10-22G, and 3090-24G.
Software Dependencies No We implement our models and all other baselines in Py Torch (Paszke et al., 2019). We utilize the Adam W (Loshchilov & Hutter, 2019) optimizer with a weight decay of 0.05, and the learning rate is set to 1e-4 for all methods.
Experiment Setup Yes Implementation Details. We implement our models and all other baselines in Py Torch (Paszke et al., 2019). We use the distributed training framework with 16 A100 GPUs. Each mini-batch has two images per GPU, and each batch contains one data type. In particular, we adopt pseudo video training on COCO by moving image masks with random directions. All the models are trained with 12 epochs. For data augmentation, we adopt large-scale jitter as previous works (Cheng et al., 2022) to build strong baselines. For all models, we adopt the same training steps and optimizers. Refer to the appendix (Sec. 7.1) for more details. ... We utilize the Adam W (Loshchilov & Hutter, 2019) optimizer with a weight decay of 0.05, and the learning rate is set to 1e-4 for all methods. We warm up the learning rate in the first 500 iterations using a linearly increased strategy and decay the learning rate at 8 and 11 epochs by a factor of 10. ... By default, we set λcls = 2, λce = 5, λdice = 5.