SAM 2: Segment Anything in Images and Videos

Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, Christoph Feichtenhofer

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3 fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6 faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, the dataset, an interactive demo and code. Our experiments (6) show that SAM 2 delivers a step-change in the video segmentation experience. SAM 2 can produce better segmentation accuracy while using 3 fewer interactions than prior approaches. Further, SAM 2 outperforms prior work in established video object segmentation benchmarks, under multiple evaluation settings, and delivers better performance compared to SAM on image segmentation benchmarks, while being 6 faster. SAM 2 is shown to be effective across a variety of video and image distributions as observed through numerous zero-shot benchmarks including 17 for video segmentation and 37 for single-image segmentation.
Researcher Affiliation Industry Nikhila Ravi*, Valentin Gabeur* Yuan-Ting Hu* Ronghang Hu* Chaitanya Ryali* Tengyu Ma* Haitham Khedr* Roman Rädle* Chloe Rolland Laura Gustafson Eric Mintun Junting Pan Kalyan Vasudev Alwala Nicolas Carion Chao-Yuan Wu Ross Girshick Piotr Dollár Christoph Feichtenhofer*, Meta FAIR, https://github.com/facebookresearch/sam2
Pseudocode No The paper describes the model architecture and training process in detail, often with figures (e.g., Figure 3 for SAM 2 architecture, Figure 8 for Mask decoder architecture), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured code-like steps.
Open Source Code Yes We are releasing our main model, the dataset, an interactive demo and code. We are releasing our work under permissive open licences, including the SA-V dataset, the SAM 2 model checkpoints, training code, and code for an interactive web demo. Repository https://github.com/facebookresearch/sam2
Open Datasets Yes We are releasing our work under permissive open licences, including the SA-V dataset, the SAM 2 model checkpoints, training code, and code for an interactive web demo. Our final Segment Anything Video (SA-V) dataset (5.2) consists of 35.5M masks across 50.9K videos, 53 more masks than any existing video segmentation dataset. We are releasing SA-V under a permissive license. The dataset is available under a Creative Commons Attribution 4.0 International Public License at https://ai.meta.com/datasets/segment-anything-video/.
Dataset Splits Yes SA-V training, validation and test splits. We split SA-V based on the video authors (and their geographic locations) to ensure minimal overlap of similar objects. To create SA-V val and SA-V test sets, we focus on challenging scenarios in selecting videos, and ask annotators to identify challenging targets that are fast-moving, have complex occlusions by other objects as well as disappearance/reappearance patterns. These targets were annotated at 6 FPS using the data engine Phase 1 setup in 5.1. There are 293 masklets and 155 videos in the SA-V val split, and 278 masklets and 150 videos in the SA-V test split.
Hardware Specification Yes We conduct all benchmarking experiments on a single A100 GPU using PyTorch 2.3.1 and CUDA 12.1, under automatic mixed precision with bfloat16. ...to fit the 16-frame sequence into the 80 GB memory of A100 GPUs.
Software Dependencies Yes We conduct all benchmarking experiments on a single A100 GPU using PyTorch 2.3.1 and CUDA 12.1, under automatic mixed precision with bfloat16.
Experiment Setup Yes We use Adam W (Loshchilov & Hutter, 2019) and apply layer decay (Clark et al., 2020) on the image encoder and follow a reciprocal square-root schedule (Zhai et al., 2022). See Table 12 (a) for the hyperparameters in our pre-training stage. After pre-training, we train SAM 2 on our introduced datasets SA-V + Internal (section 5.2), a 10% subset of SA-1B, and a mixture of open-source video datasets including DAVIS (Pont-Tuset et al., 2017; Caelles et al., 2019), MOSE (Ding et al., 2023), and You Tube VOS (Xu et al., 2018b). ...The training data mixture consists of 15.2% SA-1B, 70% SA-V and 14.8% Internal. ...We train by simulating an interactive setting, sampling 8-frame sequences and randomly selecting up to 2 frames (including the first) for corrective clicks. During training, we use ground-truth masklets and model predictions to sample prompts, with initial prompts being the ground-truth mask (50% probability), a positive click from the ground-truth mask (25%), or a bounding box input (25%).