reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Authors: Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, Salman Khan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Geo Pixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture.
Researcher Affiliation	Academia	1Mohamed bin Zayed University of Artificial Intelligence 2The University of Western Australia 3Link oping University 4Australian National University. Correspondence to: Akashah Shabbir <EMAIL>, Mohammed Zumri <EMAIL>.
Pseudocode	No	The paper describes the architecture and processes, but does not include a dedicated section or figure for pseudocode or a formal algorithm block.
Open Source Code	Yes	https://github.com/mbzuaioryx/Geo Pixel.
Open Datasets	Yes	We create Geo Pixel D, a multi-modal grounded conversation generation (GCG) dataset comprising 53,816 grounded phrases linked to 600,817 object masks, specifically tailored for RS image understanding. We utilize the instance-level annotated dataset, i SAID (Waqas Zamir et al., 2019), to generate grounded conversations through our annotation pipelines. To address this task, we fine-tune the Geo Pixel model on the RRSIS-D (Liu et al., 2024c) dataset. Table 4 compares performance on VRSBench (Li et al., 2024a), a dataset based on DOTAv2 and DIOR.
Dataset Splits	Yes	For the preprocessed i SAID (Waqas Zamir et al., 2019) train dataset (Appendix A), we derive 16,795 holistic, 36,793 instance, and 17,023 group annotations, collectively encompassing 600,817 objects. Following similar procedures, test set GCG descriptions (utilzing i SAID validation set images) undergoes meticulous manual curation... To address this task, we fine-tune the Geo Pixel model on the RRSIS-D (Liu et al., 2024c) dataset. The resulting Geo Pixel-ft model demonstrates superior performance compared to recent approaches, as shown by results on the RRSIS-D test and validation sets in Table 3. Moreover, Geo Pixel D and VRSBench use DOTA s training set for training and validation set for testing.
Hardware Specification	Yes	We train Geo Pixel on the Geo Pixel D dataset for the GCG task on two NVIDIA A6000-48GB GPUs, which takes around 3 days.
Software Dependencies	No	The paper mentions specific models like Intern LM2 7B, CLIP Vi T-L/14, SAM-2, and Intern LMXComposer-2.5, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	A fixed CLIP Vi T-L vision encoder with a resolution of 560 560 is employed, along with a grounded vision encoder initialized from SAM2 weights. The trainable components include a pixel decoder (D), Lo RA parameters (α = 8), a vision projector Pv, and a language projector Pt. The adaptive image divider uses a maximum patch number P to 9 for training. In our training process, with an effective batch size of 20 over 10 epochs, the learning rate increases linearly to a maximum value of 3 10 4 over the initial 100 training steps, followed by a gradual cosine decay.