OpenVIS: Open-vocabulary Video Instance Segmentation
Authors: Pinxue Guo, Hao Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Wenqiang Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results demonstrate the proposed Inst Former achieve state-of-the-art capabilities on a comprehensive Open VIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task. |
| Researcher Affiliation | Collaboration | Pinxue Guo1,2*, Hao Huang2, Peiyang He2, Xuefeng Liu2, Tianjun Xiao2, Wenqiang Zhang1,3 1Academy for Engineering and Technology, Fudan University 2Amazon Web Services 3 School of Computer Science, Fudan University |
| Pseudocode | No | The paper describes the methods and framework components (Inst Former, Inst CLIP, Universal Rollout Association) textually and through architectural diagrams (Figure 2, Figure 3) but does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Pinxue Guo/Open VIS |
| Open Datasets | Yes | Specifically, we evaluate the proposed model on You Tube-VIS, BURST, LVVIS, and UVO datasets, encompassing a large number of novel categories, to comprehensively assess its diverse capacities. However, the training process only see the data of You Tube-VIS, which comprises only 40 categories. |
| Dataset Splits | Yes | Our Open VIS model is only trained on You Tube-VIS (a widely-used VIS dataset comprising 40 categories). This ensures that the categories present in the training data are small-scale subsets of those found in the test data. More discussion and analysis of the evaluation benchmark can be found in Supplementary. |
| Hardware Specification | Yes | The whole training is done on 8 V100 GPUs for 3 hours. |
| Software Dependencies | No | The paper mentions several software components, models, and frameworks such as 'COCO-pretrained Mask2Former', 'Vi T-B/32 of CLIP', and 'Lo RA', but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Inst Former is trained using a two-stage approach and CLIP weights are frozen during the entire training. In first stage, the open-world mask proposal network and Inst CLIP (Lo RA adapter) are trained for 6k iterations with LI and instance segmentation loss. Subsequently, we train the rollout tracker in second stage, with all other weights frozen, using LT for an additional 600 iterations. |