reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Galileo: Learning Global & Local Features of Many Remote Sensing Modalities

Authors: Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R Green, Evan Shelhamer, Hannah Kerner, David Rolnick

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce a highly multimodal transformer to represent many remote sensing modalities... We present a novel self-supervised learning algorithm... Our Galileo is a single generalist model that outperforms So TA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks. We demonstrate Galileo s accuracy on an extensive suite of benchmarks, covering many applications, domains, and RS data types.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute 2Mc Gill University 3Allen Institute for AI (Ai2) 4Carleton University 5University of British Columbia 6Vector Institute 7Arizona State University. Correspondence to: Gabriel Tseng <EMAIL>.
Pseudocode	No	The paper describes the methodology in detail using prose and mathematical equations in Section 2 "Global, Local, Multimodal Self-Supervision" and Appendix A.1 "The Galileo SSL algorithm", but it does not include a distinct block labeled "Pseudocode" or "Algorithm" with structured, numbered steps.
Open Source Code	Yes	The model weights, pretraining code, pretraining data and evaluation code are open sourced at github.com/nasaharvest/galileo.
Open Datasets	Yes	We collect a large, global pretraining dataset of 127,155 instances. Figure 3 maps the training points. ... To construct the Galileo dataset, we split the global World Cover map (Zanaga et al., 2022) into 1000 1000 pixels (10km 10km) tiles. ... We evaluate our model on all Sentinel-2 tasks in Geo Bench (Lacoste et al., 2024). These cover single-timestep image classification and segmentation in various applications and geographies. We also test on fine-grained segmentation via the MADOS marine debris dataset (Kikaki et al., 2024), Sentinel-1 image segmentation via Sen1Floods11 (Bonafilia et al., 2020), image time series segmentation via PASTIS (Garnot & Landrieu, 2021), optical pixel time series classification via Breizhcrops (Rußwurm et al., 2019), and multimodal pixel time series classification via Crop Harvest (Tseng et al., 2021).
Dataset Splits	Yes	For all Geo Bench-modified datasets (Lacoste et al., 2024) m-Eurosat, m-Big Earthnet, m-So2Sat, m-Brick-Kiln, m-Cashew-Plant and m-SA-Crop-Type, we use the training, validation and test splits shared by Geo Bench. In addition, we use the 1%, 5% and 20% partitions shared by Geo Bench. ... MADOS (Kikaki et al., 2024): We use the train/val/test splits from MADOS (50%/25%/25%). ... PASTIS (Garnot & Landrieu, 2021): ... we use folds {1, 2, 3} for training, 4 for validation and 5 for testing. ... Breizhcrops (Rußwurm et al., 2019): We use 2 for training (FRH01, with 178,613 parcels and FRH02 with 140,645 parcels). We use FRH03 (166,391 parcels) for validation and FRH04 (122,614 parcels) for testing. ... Crop Harvest (Tseng et al., 2021): ... (i) crop vs. non crop in Togo, with 1,319 samples in the training set and 306 samples in the test set...
Hardware Specification	Yes	All models are trained on single H100 GPUs (model sizes and training times are described in Table 12). We use an effective batch size of 512, which consists of minibatches of 32 instances augmented and repeated 4 times (Hoffer et al., 2019). ... GPU-hours describes the number of GPU-hours required to pretrain each model for 500 epochs on an H100 GPU.
Software Dependencies	No	The paper mentions using "Adam W optimizer" and "scikit-learn" but does not specify version numbers for these or any other software libraries (e.g., Python, PyTorch, TensorFlow) used in the implementation.
Experiment Setup	Yes	We use an effective batch size of 512, which consists of minibatches of 32 instances augmented and repeated 4 times (Hoffer et al., 2019). For data augmentations, we randomly apply vertical and horizontal flipping and 90-degree rotations to each instance. ... We use bfloat16 precision, and the Adam W optimizer with β1 = 0.9 and β2 = 0.999 with gradient clipping. We warmup our learning rate for 30 epochs to a maximum learning rate before applying a cooldown via a cosine decay schedule. We use exponential moving averaging (EMA) to update our target encoder with a momentum value of 0.996 which linearly increases to 1 throughout pretraining following Assran et al. (2022). For all ablations (Section 4.1), we pretrain a Vi T-Tiny model for 200 epochs to a maximum learning rate of 2 10 3 and use a weight decay of 0.02. For the final Galileo models, we pretrain the models for 500 epochs and conduct a sweep of [learning rate weight decay]. For the Vi T-Nano and Vi T-Tiny architectures, we sweep learning rates [1 10 3, 2 10 3, 3 10 3] and weight decays [1 10 2, 2 10 2, 3 10 2]. For the Vi T-Base architecture, we sweep learning rates [1 10 4, 3 10 4, 1 10 3, 2 10 3, 3 10 3] and weight decays [1 10 2, 2 10 2, 3 10 2].