Gridded Transformer Neural Processes for Spatio-Temporal Data

Authors: Matthew Ashman, Cristiana Diaconu, Eric Langezaal, Adrian Weller, Richard E Turner

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method consistently outperforms a range of strong baselines in various synthetic and real-world regression tasks involving large-scale data, while maintaining competitive computational efficiency. Experiments with weather data highlight the potential of gridded TNPs and serve as just one example of a domain where they can have a significant impact.
Researcher Affiliation Academia 1University of Cambridge 2University of Amsterdam 3Alan Turing Institute. Correspondence to: Matthew Ashman <EMAIL>, Cristiana Diaconu <EMAIL>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks within its content. It refers to external pseudocode in Appendix B.1: "Pseudo-code for a forward pass through these models is provided in Algorithms 3 and 4 in Ashman et al. (2024b)."
Open Source Code Yes We also provide a public implementation of gridded TNPs in the repository https://github.com/cambridge-mlg/gridded-tnp.
Open Datasets Yes We perform two experiments on data from the ERA5 reanalysis by the European Centre for Medium-Range Weather Forecasts (ECMWF; Hersbach et al. 2020). We construct each context dataset by combining the t2m at a random subset of 9, 957 weather station locations ... extracted from the Had ISD dataset (Dunn et al., 2012). To illustrate the generality of our framework, we include an additional study in Appendix G.6 using the large-scale EAGLE fluid-dynamics dataset (Janny et al., 2023).
Dataset Splits Yes We train on data between 2009-2017, validate on 2018 and test on 2019.
Hardware Specification Yes Training and inference for all models is performed on one NVIDIA Ge Force RTX 2080 Ti. Training and inference are performed using a single NVIDIA A100 80GB with 32 CPU cores.
Software Dependencies No The paper mentions software like GPy Torch, Adam W optimizer, and U-Net architecture, but does not provide specific version numbers for any of these or other key software components, which is required for reproducibility.
Experiment Setup Yes For all experiments and all models, we use the Adam W optimiser (Loshchilov & Hutter, 2019) with a fixed learning rate of 5 10 4 and apply gradient clipping to gradients with magnitude greater than 0.5. In all experiments, we use C = 128, a kernel size of five or nine, and a stride of one. All MHSA / MHCA operations use H = 8 heads, each with DV = 16 dimensions. We use a Dz = DQK = 128 throughout. We train all models for 500, 000 iterations on 160, 000 pre-generated datasets using a batch size of eight.