ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation

Authors: Shiqi Huang, Shuting He, Bihan Wen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We establish new experimental protocols and benchmarks, and extensive experiments convincingly demonstrate that Zo RI achieves the state-of-art performance on the zero-shot remote sensing instance segmentation task. Experiments Experimental Setup Datasets We establish two remote sensing zero-shot instance segmentation benchmarks with i SAID (Zamir et al. 2019) and NWPU-VHR-10 (Cheng et al. 2014; Su et al. 2019) datasets.
Researcher Affiliation Academia Shiqi Huang1*, Shuting He2*, Bihan Wen1 1Nanyang Technological University 2Shanghai University of Finance and Economics EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and formulations in prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Huang Shiqi128/Zo RI
Open Datasets Yes We establish two remote sensing zero-shot instance segmentation benchmarks with i SAID (Zamir et al. 2019) and NWPU-VHR-10 (Cheng et al. 2014; Su et al. 2019) datasets.
Dataset Splits No i SAID dataset is divided into 11 seen classes and 4 unseen classes ( tennis court , helicopter , swimming pool and soccer ball field ), which has the same seen/unseen split for DOTA (Zang et al. 2024; Xia et al. 2018), and NWPUVHR-10 dataset is split into 7 seen classes and 3 unseen classes ( ship , basketball court and harbor ). For the training set, only images containing seen class objects are selected, while any images with unseen classes are excluded to avoid information leakage. Dataset details can be found in the supplementary material (Huang, He, and Wen 2024). The paper specifies class splits and data exclusion criteria but does not provide specific train/validation/test image splits within the main text.
Hardware Specification Yes All experiments are conducted with one RTXA5000 GPU.
Software Dependencies No The proposed method is developed based on FC-CLIP (Yu et al. 2023). We use the LAION-2B pretrained Conv Next-Large (Liu et al. 2022) from Open CLIP (Ilharco et al. 2021) as the feature extractor. The mask generator follows Mask2Former (Cheng et al. 2022) with object query number set to 300. Prompt templates for RESISC45 (Cheng, Han, and Lu 2017) used in CLIP (Radford et al. 2021) are employed to obtain text embeddings with the pretrained CLIP text encoder. The paper mentions several software frameworks and models used, but it does not specify explicit version numbers for these dependencies (e.g., PyTorch version, Python version).
Experiment Setup Yes We train the model for 50 epochs with training batch size 2. Input images are resized to 512 512 during training. Hyperparameters λ and α are set to 0.7, 0.5, respectively. Instance number T and trainable channels in KMA is empirically set to 1 and 32. The model is optimized using Adam W optimizer. The learning rate is set to 1.25 10 5.