reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adversarial Robustness of Graph Transformers

Authors: Philipp Foth, Lukas Gosch, Simon Geisler, Leo Schwinn, Stephan Günnemann

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our attacks on multiple tasks and perturbation models, including structure perturbations for node and graph classification, and node injection for graph classification. Our results reveal that GTs can be catastrophically fragile in many cases. Addressing this vulnerability, we show how our adaptive attacks can be effectively used for adversarial training, substantially improving robustness.
Researcher Affiliation	Academia	Philipp Foth EMAIL School of Computation, Information and Technology Technical University of Munich
Pseudocode	Yes	Algorithm 1 Our k-step free adversarial training
Open Source Code	Yes	The code to reproduce our results can be found at https://github.com/isefos/gt_robustness.
Open Datasets	Yes	We first evaluate our structure attacks on CLUSTER (Dwivedi et al., 2023) [...] We also consider the graph classification dataset Reddit Threads (Rozemberczki et al., 2020). [...] we evaluate on the UPFD fake news detection datasets (Dou et al., 2021).
Dataset Splits	Yes	We used the standard Py G train/val/test split of 83.3/8.3/8.3% graphs. The binary graph classification dataset Reddit Threads (Rozemberczki et al., 2020) contains 203 088 graphs with an average of 23.9 nodes. We used a stratified random split of 75/12.5/12.5%. The binary graph classification dataset UPFD gossipcop (Dou et al., 2021) contains 5464 graphs with an average of 58 nodes. We use the standard Py G split of 20/10/70%. The binary graph classification dataset UPFD politifact (Dou et al., 2021) contains 314 graphs with an average of 131 nodes. We use the standard Py G split of 20/10/70%.
Hardware Specification	No	No specific hardware details such as CPU/GPU models or memory specifications are provided in the paper.
Software Dependencies	No	The paper mentions PyTorch 2.7.1 in a theoretical discussion about the ReLU function's differentiability, and Py G in relation to dataset splits, but does not provide a comprehensive list of software dependencies with specific version numbers used for their experimental setup.
Experiment Setup	Yes	To obtain trained models of comparable performance for each architecture type, we performed a hyperparameter search for each model and dataset. [...] The final hyperparameters of the best models used for the robustness results are shown for Graphormer in Tab. 4, for SAN in Tab. 5, for GRIT in Tab. 6, for Polynormer in Tab. 8, for GPS in Tab. 7, for GCN in Tab. 9, for GPS-GCN in Tab. 10, for GAT in Tab. 11, and for GATv2 in Tab. 12.