GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing
Authors: Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, Salman Khan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Geo Pixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. |
| Researcher Affiliation | Academia | 1Mohamed bin Zayed University of Artificial Intelligence 2The University of Western Australia 3Link oping University 4Australian National University. Correspondence to: Akashah Shabbir <EMAIL>, Mohammed Zumri <EMAIL>. |
| Pseudocode | No | The paper describes the architecture and processes, but does not include a dedicated section or figure for pseudocode or a formal algorithm block. |
| Open Source Code | Yes | https://github.com/mbzuaioryx/Geo Pixel. |
| Open Datasets | Yes | We create Geo Pixel D, a multi-modal grounded conversation generation (GCG) dataset comprising 53,816 grounded phrases linked to 600,817 object masks, specifically tailored for RS image understanding. We utilize the instance-level annotated dataset, i SAID (Waqas Zamir et al., 2019), to generate grounded conversations through our annotation pipelines. To address this task, we fine-tune the Geo Pixel model on the RRSIS-D (Liu et al., 2024c) dataset. Table 4 compares performance on VRSBench (Li et al., 2024a), a dataset based on DOTAv2 and DIOR. |
| Dataset Splits | Yes | For the preprocessed i SAID (Waqas Zamir et al., 2019) train dataset (Appendix A), we derive 16,795 holistic, 36,793 instance, and 17,023 group annotations, collectively encompassing 600,817 objects. Following similar procedures, test set GCG descriptions (utilzing i SAID validation set images) undergoes meticulous manual curation... To address this task, we fine-tune the Geo Pixel model on the RRSIS-D (Liu et al., 2024c) dataset. The resulting Geo Pixel-ft model demonstrates superior performance compared to recent approaches, as shown by results on the RRSIS-D test and validation sets in Table 3. Moreover, Geo Pixel D and VRSBench use DOTA s training set for training and validation set for testing. |
| Hardware Specification | Yes | We train Geo Pixel on the Geo Pixel D dataset for the GCG task on two NVIDIA A6000-48GB GPUs, which takes around 3 days. |
| Software Dependencies | No | The paper mentions specific models like Intern LM2 7B, CLIP Vi T-L/14, SAM-2, and Intern LMXComposer-2.5, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | A fixed CLIP Vi T-L vision encoder with a resolution of 560 560 is employed, along with a grounded vision encoder initialized from SAM2 weights. The trainable components include a pixel decoder (D), Lo RA parameters (α = 8), a vision projector Pv, and a language projector Pt. The adaptive image divider uses a maximum patch number P to 9 for training. In our training process, with an effective batch size of 20 over 10 epochs, the learning rate increases linearly to a maximum value of 3 10 4 over the initial 100 training steps, followed by a gradual cosine decay. |