OpenVIS: Open-vocabulary Video Instance Segmentation

Authors: Pinxue Guo, Hao Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Wenqiang Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results demonstrate the proposed Inst Former achieve state-of-the-art capabilities on a comprehensive Open VIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task.
Researcher Affiliation Collaboration Pinxue Guo1,2*, Hao Huang2, Peiyang He2, Xuefeng Liu2, Tianjun Xiao2, Wenqiang Zhang1,3 1Academy for Engineering and Technology, Fudan University 2Amazon Web Services 3 School of Computer Science, Fudan University
Pseudocode No The paper describes the methods and framework components (Inst Former, Inst CLIP, Universal Rollout Association) textually and through architectural diagrams (Figure 2, Figure 3) but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Pinxue Guo/Open VIS
Open Datasets Yes Specifically, we evaluate the proposed model on You Tube-VIS, BURST, LVVIS, and UVO datasets, encompassing a large number of novel categories, to comprehensively assess its diverse capacities. However, the training process only see the data of You Tube-VIS, which comprises only 40 categories.
Dataset Splits Yes Our Open VIS model is only trained on You Tube-VIS (a widely-used VIS dataset comprising 40 categories). This ensures that the categories present in the training data are small-scale subsets of those found in the test data. More discussion and analysis of the evaluation benchmark can be found in Supplementary.
Hardware Specification Yes The whole training is done on 8 V100 GPUs for 3 hours.
Software Dependencies No The paper mentions several software components, models, and frameworks such as 'COCO-pretrained Mask2Former', 'Vi T-B/32 of CLIP', and 'Lo RA', but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Inst Former is trained using a two-stage approach and CLIP weights are frozen during the entire training. In first stage, the open-world mask proposal network and Inst CLIP (Lo RA adapter) are trained for 6k iterations with LI and instance segmentation loss. Subsequently, we train the rollout tracker in second stage, with all other weights frozen, using LT for an additional 600 iterations.