Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

Authors: Jeremy Irvin, Emily Liu, Joyce Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, Stefano Ermon

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that TEOChat can perform a wide variety of spatial and temporal reasoning tasks, substantially outperforming previous vision and language assistants, and even achieving comparable or better performance than several specialist models trained to perform specific tasks. Furthermore, TEOChat achieves impressive zero-shot performance on a change detection and change question answering dataset, outperforms GPT-4o and Gemini 1.5 Pro on multiple temporal tasks, and exhibits stronger single image capabilities than a comparable single image instruction-following model on scene classification, visual question answering, and captioning.
Researcher Affiliation Academia Jeremy Andrew Irvin*, Emily Ruoyu Liu, Joyce Chuyi Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, Stefano Ermon Stanford University *Correspondence to: EMAIL
Pseudocode No The paper describes methods and processes in text but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We publicly release our data, model, and code at https://github.com/ermongroup/TEOChat. ... Training data, model weights, training code, and evaluation code are hosted at https://github.com/ermongroup/TEOChat. All major experimental results can be reproduced by following the steps described there.
Open Datasets Yes To train TEOChat, we curate an instruction-following dataset composed of many single image and temporal tasks including building change and damage assessment, semantic change detection, and temporal scene classification. ... To train TEOChat, we curate TEOChatlas, the first instruction-tuning dataset with instruction-following examples for temporal EO data. We construct a variety of tasks which require spatial and temporal reasoning capabilities using four EO datasets, namely f Mo W (Christie et al., 2018), x BD (Gupta et al., 2019), S2Looking (Shen et al., 2021), and QFabric (Verma et al., 2021).
Dataset Splits Yes TEOChat achieves impressive performance on TSC with the f Mo W RGB (75.1%) and Sentinel (45.5%) validation sets, outperforming both Video-LLa VA (16.6%, 4.9%) and Geo Chat (59.2%, 26.3%) (Table 1). ... We include the x BD (Gupta et al., 2019) training dataset in TEOChatlas. ... We report model performance on the test set.
Hardware Specification Yes These strategies allow us to train the large multimodal model on temporal sequences of up to 8 images using an NVIDIA A4000 GPU (16GB VRAM). We provide additional training details in Appendix Section F. ... we train the network on 10 GPUs using data parallelism together with optimizer state partitioning, gradient partitioning, parameter partitioning, and CPU offloading with Deep Speed Zero-3 Offload (Rasley et al., 2020).
Software Dependencies No The paper mentions using Low-Rank Adaptation (Lo RA), Adam W, and Deep Speed Zero-3 Offload, but does not provide specific version numbers for these or other software libraries.
Experiment Setup Yes We fine-tune the LLM in TEOChat using Lo RA rank 128. Before inputting the images into the image encoder, we resize the shorter dimension of each image to 224 pixels, and then apply a center crop to obtain a 224x224 image. These design decisions allow us to train the model with sequences of up to 8 images on a single NVIDIA A4000 GPU (16B of VRAM) with a batch size of 1. To increase the effective batch size, we use 8 steps of gradient accumulation and train the network on 10 GPUs using data parallelism together with optimizer state partitioning, gradient partitioning, parameter partitioning, and CPU offloading with Deep Speed Zero-3 Offload (Rasley et al., 2020). We optimize the network using Adam W (Loshchilov & Hutter, 2017) with a cosine learning rate scheduler, a peak learning rate of 2e-5, and a warmup of 3% of the epoch. The model takes 125 hours to train per epoch and we train for 2 epochs.