Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

SMITE: Segment Me In TimE

Authors: Amirhossein Alimohammadi, Sauradip Nag, Saeid Asgari, Andrea Tagliasacchi, Ghassan Hamarneh, Ali Mahdavi Amiri

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our design choices and methodology through comprehensive experiments detailed in the paper. As existing datasets with arbitrary semantic granularity are lacking, we introduce a small dataset, SMITE-50, to demonstrate the superior performance of our method against baselines. Additionally, we conduct user studies that highlight our method s effectiveness in terms of segmentation accuracy and temporal consistency. [...] 5 RESULTS AND EXPERIMENTS Dataset and benchmark. To evaluate our method, we introduce a benchmark dataset called SMITE50, primarily sourced from Pexels. SMITE-50 features multi-granularity annotations and includes visually challenging scenarios such as pose changes and occlusions. [...] Quantitative comparison. Since there are no available few-shot video part segmentation methods that can be applied to our setting, we build the multiple baselines for quantitative comparison. [...] Ablation study. To demonstrate the effectiveness of our design choices, we conducted a comprehensive ablation study showing the impact of each component of our method one by one on the entire SMITE-50 dataset. Tab. 2 shows that our final setting provides the highest accuracy while the impact of each component is noticeable.
Researcher Affiliation Collaboration Amirhossein Alimohammadi1, Sauradip Nag1, Saeid Asgari Taghanaki1,2, Andrea Tagliasacchi1,3,4, Ghassan Hamarneh1, Ali Mahdavi Amiri1 1Simon Fraser University 2Autodesk Research 3University of Toronto 4Google Deep Mind
Pseudocode Yes Algorithm 1 Temporal Voting
Open Source Code No The project page is available at https://segment-me-in-time.github.io/. [...] To better show the capabilities of our method, we also collected a flexible granularity dataset called SMITE-50 that will be publicly available along with our code.
Open Datasets Yes As existing datasets with arbitrary semantic granularity are lacking, we introduce a small dataset, SMITE-50, to demonstrate the superior performance of our method against baselines. [...] While the Puma VOS dataset Bekuzarov et al. (2023) contains a limited number of multi-part annotated videos across a diverse range, it does not offer multiple videos for specific granularities and categories. [...] We evaluate our dataset on DAVIS-2016 for the video object segmentation (VOS) task using two different settings : a) Zero-Shot VOS in Table 9 and b) Semi-Supervised VOS in Table 10 respectively.
Dataset Splits Yes We focus on three main categories: (a) Horses, (b) Human Faces, and (c) Cars, encompassing 41 videos. Each subset includes ten segmented reference images for training and densely annotated videos for testing. The granularity varies from human eyes to animal heads, etc. relevant for various applications such as VFX (see Fig. 5). All segments are labeled consistently with the part names used in existing datasets. Additionally, we provide nine challenging videos featuring faces with segments that cannot be described textually, as shown in Fig. 3 (Non-Text). Overall, our dataset comprises 50 video clips, each at least five seconds long. For dense annotations, we followed a similar approach to (Ding et al., 2023; Bekuzarov et al., 2023), creating masks for every fifth frame with an average of six parts per frame across three granularity types (more info in Appendix). While Puma VOS dataset Bekuzarov et al. (2023) has 8% annotations, our SMITE-50 dataset has 20% dense annotations.
Hardware Specification Yes All experiments were conducted on a single NVIDIA RTX 3090 GPU. During the learning phase, we initially optimized only the text embeddings for the first 100 iterations. For the subsequent iterations, we optimized the cross-attention to k and to v parameters. [...] Using 10 training images with a batch size of 1 on 15 GB of GPU VRAM, training takes 20 minutes. By increasing the batch size and utilizing more GPU VRAM, training time can be reduced to 7 minutes. [...] For inference, each frame takes 26 seconds and requires 60 GB of GPU VRAM to process the entire video. However, it is possible to adjust settings to use only 15 GB, though this increases the inference time. [...] These strategies enable our model to process longer videos effectively on GPUs with as little as 24 GB of VRAM.
Software Dependencies No The paper mentions using Stable Diffusion (SD) semantic knowledge and PyTorch (implied through typical deep learning libraries but not explicitly stated with version). No specific software versions are provided for reproducibility.
Experiment Setup Yes During the learning phase, we initially optimized only the text embeddings for the first 100 iterations. For the subsequent iterations, we optimized the cross-attention to k and to v parameters. [...] For the categories of horses, cars, faces and non-text, we used 10 reference images from our SMITE50 benchmark for both our method and the baseline comparisons. [...] Regarding the window size in our tracking module, we found that fast-moving objects benefit from a smaller window size to mitigate potential bias. Consequently, we set the window size to 7 for horses and 15 for other categories. [...] During inference, we added noise corresponding to 100 timesteps and performed a single denoising pass when segment tracking and voting were not employed. When using segment tracking and voting, we applied spatio-temporal guidance at each denoising step and conducted backpropagation 15 times per denoising timestep. For the regularization parameters, we set λReg across all experiments. The tracking parameter λTracking was set to 1 for horses, 0.5 for faces, and either 0.2 or 1 for cars. Additionally, we applied a Discrete Cosine Transform (DCT) low-pass filter with a threshold of 0.4. [...] The strategy to accelerate convergence and simplify parameter tuning in the code involves the use of an Adam-like optimization approach that dynamically adapts the learning rate and gradient updates for the latent variables. Specifically, the code implements the first and second moment estimates, denoted as M1 and M2 , which accumulate the gradients and squared gradients, respectively. [...] momentum parameters typically set to 0.9 and 0.999, respectively.