reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

Authors: Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen, Liang-Chieh Chen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Axial-VS based on two different clip-level segmenters on four widely used video segmentation benchmarks to show its generalizability.
Researcher Affiliation	Collaboration	Ju He EMAIL Johns Hopkins University Qihang Yu EMAIL Byte Dance Inkyu Shin EMAIL Korea Advanced Institute of Science and Technology Xueqing Deng EMAIL Byte Dance Alan Yuille EMAIL Johns Hopkins University Xiaohui Shen EMAIL Byte Dance Liang-Chieh Chen EMAIL Byte Dance
Pseudocode	No	The paper describes methods with formulations (Eq 1-6) and figures (Fig 2, 3, 4, 5, 6) showing architectures and flows, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code and models are available here.
Open Datasets	Yes	Specifically, for video panoptic segmentation (VPS), we build Axial-VS based on Video-k Ma X (Shin et al., 2024) and report performance on VIPSeg (Miao et al., 2022). We also build Axial-VS on top of Tube-Link (Li et al., 2023b) for video instance segmentation (VIS) and report the performance on Youtube-VIS 2021 (Yang et al., 2021a), 2022 (Yang et al., 2022), and OVIS (Qi et al., 2022). [...] Concretely, starting with an Image Net (Russakovsky et al., 2015) pre-trained backbone, we pre-train the k Ma X-Deep Lab and Multi-Scale Deformable Attention (MSDeform Attn) in our within-clip tracking module on COCO (Lin et al., 2014).
Dataset Splits	Yes	For the near-online setting (i.e., employing the within-clip tracking module), we use a clip size of two and four for VPS and VIS, respectively. For the offline setting (i.e., employing the cross-clip tracking module), we adopt a video length of 24 (i.e., 12 clips) for VPS and 20 (i.e., 5 clips) for VIS. [...] Our near-online Axial-VS is trained on the VIPSeg dataset with a clip size of 2 769 1345 and a batch size of 32 [...] Additionally, our offline Axial-VS is trained on VIPSeg with a video size of 24 769 1345 (12 clips, each comprising 2 frames) and a batch size of 16 [...].
Hardware Specification	Yes	Our near-online Axial-VS is trained on the VIPSeg dataset with a clip size of 2 769 1345 and a batch size of 32, utilizing 16 V100 32G GPUs for 40k iterations. This training regimen spans approximately 13 hours. Additionally, our offline Axial-VS is trained on VIPSeg with a video size of 24 769 1345 (12 clips, each comprising 2 frames) and a batch size of 16, employing 8 A100 80G GPUs for 15k iterations.
Software Dependencies	No	For VPS experiments, we first reproduce Video-k Ma X (Shin et al., 2024) based on the official Py Torch re-implementation of k Ma X-Deep Lab (Yu et al., 2022b).
Experiment Setup	Yes	Our near-online Axial-VS is trained on the VIPSeg dataset with a clip size of 2 769 1345 and a batch size of 32, utilizing 16 V100 32G GPUs for 40k iterations. This training regimen spans approximately 13 hours. Additionally, our offline Axial-VS is trained on VIPSeg with a video size of 24 769 1345 (12 clips, each comprising 2 frames) and a batch size of 16, employing 8 A100 80G GPUs for 15k iterations. [...] Our near-online Axial-VS is trained on Youtube-VIS with a batch size of 8 clips (each containing 4 frames) using 8 V100 32G GPUs for 15k iterations. We adhere to the literature by randomly resizing the shortest edge of each clip to a predetermined size within the range [288, 320, 352, 384, 416, 448, 480, 512]. [...] As a result, the within-clip and cross-clip tracking modules use Nw = 6 and Nc = 4 blocks, respectively, for VIS.