reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Spatio-Temporal Centralized Interaction for OOD Learning

Authors: Jiaming Ma, Binwu Wang, Pengkun Wang, Zhengyang Zhou, Xu Wang, Yang Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Compared with 14 baselines across six datasets, STOP achieves up to 17.01% improvement in generalization performance and 18.44% improvement in inductive learning performance. In this section, we conduct a comprehensive evaluation of the proposed model.
Researcher Affiliation	Academia	1University of Science and Technology of China (USTC), Hefei, China 2Suzhou Institute for Advanced Research, USTC, Suzhou, China. First author mail: Jiaming EMAIL. Correspondence to: Binwu Wang and Yang Wang as corresponding authors <EMAIL and EMAIL>.
Pseudocode	Yes	We have provided the pseudocode of the algorithm in Algorithm 1, where we can observe that STOP makes final predictions based on the temporal component and spatial component. This includes a perturbation process to extract robust knowledge. This perturbation process only occurs in the training phase and we no longer use it in the test phase. We also provide the optimization flow of Gen PU and model parameters in Algorithm 2.
Open Source Code	Yes	The code is available at https://github.com/Poor Otter Bob/STOP.
Open Datasets	Yes	We conduct a comprehensive evaluation of our model on six spatio-temporal datasets spanning multiple years across two domains. These datasets include Large ST (Liu et al., 2024b) and PEMSD3-Stream (Chen et al., 2021) in the traffic domain, and Know Air (Wang et al., 2020) in the atmospheric domain. The dataset summary is presented in Table 1.
Dataset Splits	Yes	The training set comprises the first 60% of data from the initial year dataset, while the following 20% of data is used as the validation set. In each subsequent year, the last 20% of data is designated as the test set. This setup aims to accentuate the temporal distribution difference between the test and training sets, while maintaining a ratio of approximately 6:2:2 for the training, validation, and test sets. Regarding structural shift evaluation, we select a subset of nodes for training and validation. In the test set, we randomly masked 10% of nodes to simulate node disappearance and added 30% of nodes as new nodes to simulate shifts in the graph structure and scale.
Hardware Specification	Yes	We implement all models using PyTorch framework of Python 3.8.3 and leveraging the Nvidia A100-PCIE-40GB as support, MAE, RMSE, and MAPE are used as metrics for comparison.
Software Dependencies	No	We implement all models using PyTorch framework of Python 3.8.3 and leveraging the Nvidia A100-PCIE-40GB as support, MAE, RMSE, and MAPE are used as metrics for comparison. Although Python 3.8.3 is mentioned with a version, PyTorch is mentioned only as a 'framework' without a specific version number, thus not fulfilling the requirement of multiple versioned software components.
Experiment Setup	Yes	We set both the input and prediction windows to 12 in traffic prediction and 24 in atmospheric prediction. Temporal decomposition kernel size ξ is equal to 3 in traffic datasets and 7 in Know Air. The number of Con AU K is set to {8, 24, 32, 64, 8, 4} and the number of Gen PU M is equal to {3, 3, 3, 3, 2, 4} in six datasets in Table 1. The dimensions of embeddings are set to 64. We use 8 heads in multi-head low-rank attention.