reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GeoAggregator: An Efficient Transformer Model for Geo-Spatial Tabular Data

Authors: Rui Deng, Ziqi Li, Mingshu Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark Geo Aggregator against spatial statistical models, XGBoost, and several state-of-the-art geospatial deep learning methods using both synthetic and empirical geospatial datasets. The results demonstrate that Geo Aggregators achieve the best or second-best performance compared to their competitors on nearly all datasets. Geo Aggregator s efficiency is underscored by its reduced model size, making it both scalable and lightweight. Moreover, ablation experiments offer insights into the effectiveness of the Gaussian bias and Cartesian attention mechanism, providing recommendations for further optimizing the Geo Aggregator s performance.
Researcher Affiliation	Academia	Rui Deng1, Ziqi Li2, Mingshu Wang1* 1School of Geographical and Earth Science, University of Glasgow 2Department of Geography, Florida State University EMAIL, EMAIL
Pseudocode	No	The paper describes the model architecture and components through text and diagrams (e.g., Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Code and Data https://github.com/ruid7181/GeoAggregator
Open Datasets	Yes	To further illustrate the performances, we use three realworld datasets of different sizes: (1) PM25: from (Dai et al. 2022), which includes 1,457 PM2.5 concentration measurements across mainland China, coupled with related environmental factors. It presents a long-range spatial regression problem due to its sparse and uneven data distribution; (2) Housing: as per (Li 2024), this dataset contains housing prices in King County, WA, USA. It represents a small-scale, densely distributed spatial dataset with notable spatial effects; (3) Poverty: Sourced from (Kolak et al. 2020), this dataset includes 14 socioeconomic variables that estimate poverty levels across the continental US.
Dataset Splits	Yes	We set the splitting ratio of training-validation-testing to be 7:1:2 for all datasets except for the PM25 dataset, whose splitting ratio is 56:14:30 (Dai et al. 2022).
Hardware Specification	Yes	We conduct synthetic data experiments on a laptop with 32GB of RAM and real-world data experiments on a Google Colab virtual machine equipped with an NVIDIA P100 GPU with 16GB of GPU memory.
Software Dependencies	No	The paper mentions using 'Adam optimizer' but does not specify the version numbers of any core software libraries or frameworks (e.g., Python, PyTorch, TensorFlow) used for implementation.
Experiment Setup	Yes	We set the latent dimension dmodel as 32, total number of heads H2 = 4. We implement three versions of the Geo Aggregator, with the number of processor modules L = 0, 1, 2 named Geo Aggregator-mini (GA-mini), GA-small, and GA-large, respectively. We set lhidden to be 0, 4, 8; the parameter lmax to be 81, 144, and 256, respectively. Corresponding searching radius in the Context Query operation is estimated on training datasets. We use the Adam optimizer with a cyclical learning rate scheduler (max learning rate is 5 10 3) (Kingma 2014; Popel and Bojar 2018).