TinySAM: Pushing the Envelope for Efficient Segment Anything Model

Authors: Han Shu, Wenshuo Li, Yehui Tang, Yiman Zhang, Yihao Chen, Houqiang Li, Yunhe Wang, Xinghao Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various zero-shot transfer tasks demonstrate the significantly advantageous performance of our Tiny SAM against counterpart methods. Experiments Implementation Details We utilize the Tiny Vi T-5M (Wu et al. 2022) as the lightweight student image encoder and SAM-H as the teacher model, following prior work (Zhang et al. 2023). 1% of SA-1B dataset is used as the training data for fullstage distillation.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2Huawei Noah s Ark Lab EMAIL
Pseudocode No The paper describes methods using equations and figures, but does not contain a clearly labeled pseudocode or algorithm block with structured steps.
Open Source Code Yes Code https://github.com/xinghaochen/Tiny SAM
Open Datasets Yes Together with the proposed SA-1B dataset, which contains 11 million high-resolution images and more than 1 billion high-quality segmentation masks, SAM shows impressive high quality segmentation ability for objects of any category and shape. We evaluate the zero-shot instance segmentation task for models on the benchmark of COCO (Lin et al. 2014) dataset and LVIS v1 (Gupta, Dollar, and Girshick 2019). We choose a subset of total 23 datasets used in (Kirillov et al. 2023) for efficient evaluation, which contains BBBC038v1 (Caicedo et al. 2019), DOORS (Pugliatti and Topputo 2022), Timber Seg (Fortin et al. 2022) and LVIS (Gupta, Dollar, and Girshick 2019).
Dataset Splits Yes 1% of SA-1B dataset is used as the training data for fullstage distillation. We evaluate the zero-shot instance segmentation task for models on the benchmark of COCO (Lin et al. 2014) dataset and LVIS v1 (Gupta, Dollar, and Girshick 2019). To make fair comparison, we follow the settings of SAM (Kirillov et al. 2023) to sample the images and masks, and the first N masks in the corresponding split are used in the evaluation. Evaluation on the first 100 images of COCO val2017 set.
Hardware Specification Yes The latency is tested with Tensor RT on NVIDIA T4 GPU. The latency is tested on NVIDIA T4 GPU. Latency benchmarks are conducted on a single NVIDIA V100 GPU for everything mode.
Software Dependencies No The paper mentions 'Tensor RT' but does not specify a version number. Other software like 'Adam optimizer' is mentioned but without version details for the software library or framework.
Experiment Setup Yes We utilize the Tiny Vi T-5M (Wu et al. 2022) as the lightweight student image encoder and SAM-H as the teacher model, following prior work (Zhang et al. 2023). 1% of SA-1B dataset is used as the training data for fullstage distillation. We adopt Adam optimizer and train the student network for 8 epochs. For each iteration, we sample 64 prompts according to hard prompt sampling strategy. For post training quantization, we set θl = 0.01, θu = 1.2, n = 100, rounds = 3 for iterative search. We calibrate quantized model on SA-1B dataset using 8 images. The threshold values used in the everything mode are all kept the same as default.