Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs
Authors: Jost Arndt, Utku Isil, Michael Detzel, Wojciech Samek, Jackie Ma
DMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | More precisely, we showcase three equations to model different types of disasters and hazards in the fields of epidemiology, atmospheric particles, and tsunami waves. Further, we show how such created datasets can be used by benchmarking several machine learning models on the epidemiological dataset. Additionally, we show how pre-training on this dataset can improve model performance on real-world epidemiological data. The presented methods enable others to create datasets and benchmarks customized to individual requirements. |
| Researcher Affiliation | Academia | Jost Arndt1 EMAIL Utku Isil1 EMAIL Michael Detzel1 EMAIL Wojciech Samek1,2,3 EMAIL Jackie Ma1 EMAIL 1 Department of Artifical Intelligence, Fraunhofer Heinrich Hertz Institute, Berlin 2 Department of Electrical Engineering and Computer Science, Technische Universit at Berlin 3 BIFOLD Berlin Institute for the Foundations of Learning and Data |
| Pseudocode | No | The paper describes numerical solution methods for PDEs using FEM and iterative schemes (e.g., Crank-Nicolson, Newton method) in paragraph form. While it provides mathematical formulations (Equation 1, 2, 3, 4) and describes the steps, it does not include any clearly structured pseudocode or algorithm blocks with step-by-step instructions in a programmatic format. |
| Open Source Code | Yes | The source code for our methodology and the three created datasets can be found on github.com/Jostarndt/Synthetic Datasets for Temporal Graphs. [...] The code and data can be found on github.com/Jostarndt/Synthetic Datasets for Temporal Graphs. The code is published under the GNU LESSER GENERAL PUBLIC LICENSE v2.1. |
| Open Datasets | Yes | The source code for our methodology and the three created datasets can be found on github.com/Jostarndt/Synthetic Datasets for Temporal Graphs. [...] The created datasets are published under the CC BY 4.0 license. [...] The Brazilian COVID-19 dataset has 27 nodes and 1093 time-steps spanning 2019-2022 after concatenation and linear interpolation for daily resolution and is publicly accessible. 5 sisaps.saude.gov.br/painelsaps/atendimento [...] German COVID-19 data can be found at github.com/robert-koch-institut/SARS-CoV-2-Infektionen in Deutschland, with 1539 time-steps. German Influenza dataset can be found at survstat.rki.de/ with our curation having 5256 time-steps. |
| Dataset Splits | Yes | We will proceed with the epidemiological dataset which we split 76/12/12 along the time axis into a train, test and validation dataset. The splits are multiples of four, since 4% of the dataset are exactly one wave/infectious scenario since we simulated 25 different scenarios, and wanted to prevent data-leakage across scenarios and evaluate on full waves. |
| Hardware Specification | Yes | The following computations are executed on two AMD EPYC 7543 32-Core Processors. |
| Software Dependencies | Yes | The implementation of the FEM is done with deal.ii (Arndt et al., 2023), which is written in C++ and abstracted the full implementation around the FEM, but this can be done with any other FEM library. [...] The deal.II library, version 9.5. Journal of Numerical Mathematics, 31(3):231 246, 2023. [...] The data was processed with the use of Geo Pandas (Jordahl et al., 2020), shapely and Sci Py (Virtanen et al., 2020) |
| Experiment Setup | Yes | To benchmark the presented architectures from section 4.1, we define three tasks on the synthetic epidemiological dataset, created from the SI Eq. 2, that are motivated by forecasting tasks with real-world data. [...] Forecasting on clean data The most straightforward task is a simple forecast of the next n timesteps, based on the last m timesteps of inputs. We set m = n = 14. The input data into the models therefore are 14 consecutive graphs, sharing the same adjacency, or one graph with 14 node features: (V, E, Xi,..,i+13). The targets are simply Xi+14,..,i+27. As a test- and training loss we use the RMSE over all samples, nodes in V , and forecasted timesteps m. [...] The employed training epochs for the different tasks and models can be found in Table 5. [...] The exact hyperparameters can be found in the implementation on Git Hub in the regarding parameter files, exemplary under /ml/mp_pde/mp_pde.yml for the MP-PDE model. The parameter files for other models can be found in their respective directories. The hyperparameters were chosen as optimal for each model experimentally. We trained all models for each task with the Adam optimizer until convergence (early stopping). |