reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAM 2: Segment Anything in Images and Videos

Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, Christoph Feichtenhofer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3 fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6 faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, the dataset, an interactive demo and code. Our experiments (6) show that SAM 2 delivers a step-change in the video segmentation experience. SAM 2 can produce better segmentation accuracy while using 3 fewer interactions than prior approaches. Further, SAM 2 outperforms prior work in established video object segmentation benchmarks, under multiple evaluation settings, and delivers better performance compared to SAM on image segmentation benchmarks, while being 6 faster. SAM 2 is shown to be effective across a variety of video and image distributions as observed through numerous zero-shot benchmarks including 17 for video segmentation and 37 for single-image segmentation.
Researcher Affiliation	Industry	Nikhila Ravi, Valentin Gabeur Yuan-Ting Hu* Ronghang Hu* Chaitanya Ryali* Tengyu Ma* Haitham Khedr* Roman Rädle* Chloe Rolland Laura Gustafson Eric Mintun Junting Pan Kalyan Vasudev Alwala Nicolas Carion Chao-Yuan Wu Ross Girshick Piotr Dollár Christoph Feichtenhofer*, Meta FAIR, https://github.com/facebookresearch/sam2
Pseudocode	No	The paper describes the model architecture and training process in detail, often with figures (e.g., Figure 3 for SAM 2 architecture, Figure 8 for Mask decoder architecture), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections with structured code-like steps.
Open Source Code	Yes	We are releasing our main model, the dataset, an interactive demo and code. We are releasing our work under permissive open licences, including the SA-V dataset, the SAM 2 model checkpoints, training code, and code for an interactive web demo. Repository https://github.com/facebookresearch/sam2
Open Datasets	Yes	We are releasing our work under permissive open licences, including the SA-V dataset, the SAM 2 model checkpoints, training code, and code for an interactive web demo. Our final Segment Anything Video (SA-V) dataset (5.2) consists of 35.5M masks across 50.9K videos, 53 more masks than any existing video segmentation dataset. We are releasing SA-V under a permissive license. The dataset is available under a Creative Commons Attribution 4.0 International Public License at https://ai.meta.com/datasets/segment-anything-video/.
Dataset Splits	Yes	SA-V training, validation and test splits. We split SA-V based on the video authors (and their geographic locations) to ensure minimal overlap of similar objects. To create SA-V val and SA-V test sets, we focus on challenging scenarios in selecting videos, and ask annotators to identify challenging targets that are fast-moving, have complex occlusions by other objects as well as disappearance/reappearance patterns. These targets were annotated at 6 FPS using the data engine Phase 1 setup in 5.1. There are 293 masklets and 155 videos in the SA-V val split, and 278 masklets and 150 videos in the SA-V test split.
Hardware Specification	Yes	We conduct all benchmarking experiments on a single A100 GPU using PyTorch 2.3.1 and CUDA 12.1, under automatic mixed precision with bfloat16. ...to fit the 16-frame sequence into the 80 GB memory of A100 GPUs.
Software Dependencies	Yes	We conduct all benchmarking experiments on a single A100 GPU using PyTorch 2.3.1 and CUDA 12.1, under automatic mixed precision with bfloat16.
Experiment Setup	Yes	We use Adam W (Loshchilov & Hutter, 2019) and apply layer decay (Clark et al., 2020) on the image encoder and follow a reciprocal square-root schedule (Zhai et al., 2022). See Table 12 (a) for the hyperparameters in our pre-training stage. After pre-training, we train SAM 2 on our introduced datasets SA-V + Internal (section 5.2), a 10% subset of SA-1B, and a mixture of open-source video datasets including DAVIS (Pont-Tuset et al., 2017; Caelles et al., 2019), MOSE (Ding et al., 2023), and You Tube VOS (Xu et al., 2018b). ...The training data mixture consists of 15.2% SA-1B, 70% SA-V and 14.8% Internal. ...We train by simulating an interactive setting, sampling 8-frame sequences and randomly selecting up to 2 frames (including the first) for corrective clicks. During training, we use ground-truth masklets and model predictions to sample prompts, with initial prompts being the ground-truth mask (50% probability), a positive click from the ground-truth mask (25%), or a bounding box input (25%).