SAM 2: Segment Anything in Images and Videos
Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, Christoph Feichtenhofer
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3 fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6 faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, the dataset, an interactive demo and code. Our experiments (6) show that SAM 2 delivers a step-change in the video segmentation experience. SAM 2 can produce better segmentation accuracy while using 3 fewer interactions than prior approaches. Further, SAM 2 outperforms prior work in established video object segmentation benchmarks, under multiple evaluation settings, and delivers better performance compared to SAM on image segmentation benchmarks, while being 6 faster. SAM 2 is shown to be effective across a variety of video and image distributions as observed through numerous zero-shot benchmarks including 17 for video segmentation and 37 for single-image segmentation. |
| Researcher Affiliation | Industry | Nikhila Ravi*, Valentin Gabeur* Yuan-Ting Hu* Ronghang Hu* Chaitanya Ryali* Tengyu Ma* Haitham Khedr* Roman Rädle* Chloe Rolland Laura Gustafson Eric Mintun Junting Pan Kalyan Vasudev Alwala Nicolas Carion Chao-Yuan Wu Ross Girshick Piotr Dollár Christoph Feichtenhofer*, Meta FAIR, https://github.com/facebookresearch/sam2 |
| Pseudocode | No | The paper describes the model architecture and training process in detail, often with figures (e.g., Figure 3 for SAM 2 architecture, Figure 8 for Mask decoder architecture), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured code-like steps. |
| Open Source Code | Yes | We are releasing our main model, the dataset, an interactive demo and code. We are releasing our work under permissive open licences, including the SA-V dataset, the SAM 2 model checkpoints, training code, and code for an interactive web demo. Repository https://github.com/facebookresearch/sam2 |
| Open Datasets | Yes | We are releasing our work under permissive open licences, including the SA-V dataset, the SAM 2 model checkpoints, training code, and code for an interactive web demo. Our final Segment Anything Video (SA-V) dataset (5.2) consists of 35.5M masks across 50.9K videos, 53 more masks than any existing video segmentation dataset. We are releasing SA-V under a permissive license. The dataset is available under a Creative Commons Attribution 4.0 International Public License at https://ai.meta.com/datasets/segment-anything-video/. |
| Dataset Splits | Yes | SA-V training, validation and test splits. We split SA-V based on the video authors (and their geographic locations) to ensure minimal overlap of similar objects. To create SA-V val and SA-V test sets, we focus on challenging scenarios in selecting videos, and ask annotators to identify challenging targets that are fast-moving, have complex occlusions by other objects as well as disappearance/reappearance patterns. These targets were annotated at 6 FPS using the data engine Phase 1 setup in 5.1. There are 293 masklets and 155 videos in the SA-V val split, and 278 masklets and 150 videos in the SA-V test split. |
| Hardware Specification | Yes | We conduct all benchmarking experiments on a single A100 GPU using PyTorch 2.3.1 and CUDA 12.1, under automatic mixed precision with bfloat16. ...to fit the 16-frame sequence into the 80 GB memory of A100 GPUs. |
| Software Dependencies | Yes | We conduct all benchmarking experiments on a single A100 GPU using PyTorch 2.3.1 and CUDA 12.1, under automatic mixed precision with bfloat16. |
| Experiment Setup | Yes | We use Adam W (Loshchilov & Hutter, 2019) and apply layer decay (Clark et al., 2020) on the image encoder and follow a reciprocal square-root schedule (Zhai et al., 2022). See Table 12 (a) for the hyperparameters in our pre-training stage. After pre-training, we train SAM 2 on our introduced datasets SA-V + Internal (section 5.2), a 10% subset of SA-1B, and a mixture of open-source video datasets including DAVIS (Pont-Tuset et al., 2017; Caelles et al., 2019), MOSE (Ding et al., 2023), and You Tube VOS (Xu et al., 2018b). ...The training data mixture consists of 15.2% SA-1B, 70% SA-V and 14.8% Internal. ...We train by simulating an interactive setting, sampling 8-frame sequences and randomly selecting up to 2 frames (including the first) for corrective clicks. During training, we use ground-truth masklets and model predictions to sample prompts, with initial prompts being the ground-truth mask (50% probability), a positive click from the ground-truth mask (25%), or a bounding box input (25%). |