A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

Authors: Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen, Liang-Chieh Chen

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Axial-VS based on two different clip-level segmenters on four widely used video segmentation benchmarks to show its generalizability.
Researcher Affiliation Collaboration Ju He EMAIL Johns Hopkins University Qihang Yu EMAIL Byte Dance Inkyu Shin EMAIL Korea Advanced Institute of Science and Technology Xueqing Deng EMAIL Byte Dance Alan Yuille EMAIL Johns Hopkins University Xiaohui Shen EMAIL Byte Dance Liang-Chieh Chen EMAIL Byte Dance
Pseudocode No The paper describes methods with formulations (Eq 1-6) and figures (Fig 2, 3, 4, 5, 6) showing architectures and flows, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code and models are available here.
Open Datasets Yes Specifically, for video panoptic segmentation (VPS), we build Axial-VS based on Video-k Ma X (Shin et al., 2024) and report performance on VIPSeg (Miao et al., 2022). We also build Axial-VS on top of Tube-Link (Li et al., 2023b) for video instance segmentation (VIS) and report the performance on Youtube-VIS 2021 (Yang et al., 2021a), 2022 (Yang et al., 2022), and OVIS (Qi et al., 2022). [...] Concretely, starting with an Image Net (Russakovsky et al., 2015) pre-trained backbone, we pre-train the k Ma X-Deep Lab and Multi-Scale Deformable Attention (MSDeform Attn) in our within-clip tracking module on COCO (Lin et al., 2014).
Dataset Splits Yes For the near-online setting (i.e., employing the within-clip tracking module), we use a clip size of two and four for VPS and VIS, respectively. For the offline setting (i.e., employing the cross-clip tracking module), we adopt a video length of 24 (i.e., 12 clips) for VPS and 20 (i.e., 5 clips) for VIS. [...] Our near-online Axial-VS is trained on the VIPSeg dataset with a clip size of 2 769 1345 and a batch size of 32 [...] Additionally, our offline Axial-VS is trained on VIPSeg with a video size of 24 769 1345 (12 clips, each comprising 2 frames) and a batch size of 16 [...].
Hardware Specification Yes Our near-online Axial-VS is trained on the VIPSeg dataset with a clip size of 2 769 1345 and a batch size of 32, utilizing 16 V100 32G GPUs for 40k iterations. This training regimen spans approximately 13 hours. Additionally, our offline Axial-VS is trained on VIPSeg with a video size of 24 769 1345 (12 clips, each comprising 2 frames) and a batch size of 16, employing 8 A100 80G GPUs for 15k iterations.
Software Dependencies No For VPS experiments, we first reproduce Video-k Ma X (Shin et al., 2024) based on the official Py Torch re-implementation of k Ma X-Deep Lab (Yu et al., 2022b).
Experiment Setup Yes Our near-online Axial-VS is trained on the VIPSeg dataset with a clip size of 2 769 1345 and a batch size of 32, utilizing 16 V100 32G GPUs for 40k iterations. This training regimen spans approximately 13 hours. Additionally, our offline Axial-VS is trained on VIPSeg with a video size of 24 769 1345 (12 clips, each comprising 2 frames) and a batch size of 16, employing 8 A100 80G GPUs for 15k iterations. [...] Our near-online Axial-VS is trained on Youtube-VIS with a batch size of 8 clips (each containing 4 frames) using 8 V100 32G GPUs for 15k iterations. We adhere to the literature by randomly resizing the shortest edge of each clip to a predetermined size within the range [288, 320, 352, 384, 416, 448, 480, 512]. [...] As a result, the within-clip and cross-clip tracking modules use Nw = 6 and Nc = 4 blocks, respectively, for VIS.