Q-MiniSAM2: A Quantization-based Benchmark for Resource-Efficient Video Segmentation
Authors: Xuanxuan Ren, Xiangyu Li, Kun Wei, Xu Yang, Yanhua Yang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments |
| Researcher Affiliation | Academia | Xidian University EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Post-Training Quantization |
| Open Source Code | No | Explanation: The paper does not explicitly state that its source code is available or provide a link to a repository for the methodology described. |
| Open Datasets | Yes | We conduct experiments on two object segmentation datasets: MS-COCO [Lin et al., 2014] and SA-V [Ravi et al., 2024]. |
| Dataset Splits | Yes | MS-COCO contains 123,000 images across 91 object categories, of which the training set contains 118,000 images and the validation set containing 5,000 images. The SA-V dataset comprises approximately 51,000 real-world videos and over 600,000 spatiotemporal masks (referred to as masklets), establishing it as the largest video segmentation dataset to date. Specifically, the training split consists of 505,83 videos and 642,036 masklets, while the validation split includes 155 videos and 293 masklets. Additionally, the test split contains 150 videos and 278 masklets. |
| Hardware Specification | No | Explanation: The paper mentions 'On specialized hardware' in the context of theoretical speedup, but it does not provide specific details on the GPU/CPU models or other hardware used for running experiments. |
| Software Dependencies | No | Explanation: The paper mentions 'YOLOX [Ge, 2021]' as a detector but does not provide specific version numbers for any software or libraries used. |
| Experiment Setup | Yes | For quantization training, a set of 32 unannotated training images is randomly selected to form the training dataset. In the prompt-based visual segmentation task, to obtain accurate target masks through manually annotated box prompts, 8 videos are randomly chosen from the SA-V validation set, with 20 frames extracted from each video to construct the training dataset. Following conventional methodologies, the implemented quantization strategy includes per-channel asymmetric quantization for weights and per-tensor asymmetric quantization for activation values. Each module undergoes 20,000 iterations during the reconstruction phase. Additionally, to ensure the stability and robustness of the model s performance, the first and last layers (or modules) of the network are exempted from the quantization process. The hyperparameters α, β, and γ are set to 1, 0.5 and 0.4 respectively. |