reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Faithful Interpretation for Graph Neural Networks

Authors: Lijie Hu, Tianhao Huang, Lu Yu, Wanyu Lin, Tianhang Zheng, Di Wang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments 5.1 Experimental Setup Datasets. In this study, we employ several datasets encompassing small to large-scale graphs to conduct an exhaustive comparison of our approach with the baseline methods. For the node classification task, we utilize Amazon CS and Amazon Photo (Mc Auley et al., 2015), Coauthor CS and Coauthor Physics (Shchur et al., 2018), ogbn-ar Xiv (Wang et al., 2020; Mikolov et al., 2013). For the link prediction task, we adopt Cora, Citeseer and Pubmed (Sen et al., 2008). Finally, for the graph classification task, we make use of the D&D (Debnath et al., 1991), MUTAG (Dobson & Doig, 2003) and Politifact (Dou et al., 2021) dataset. We present the statistics of the selected datasets in Table 3, 4 and 5. We also provide a detailed description of these datasets in Appendix C. Baselines. We employ three attention-based models, namely Graph Attention Network (GAT) (Veličković et al., 2018), GATv2 (Brody et al., 2021), and Graph Transformer (GT) (Dwivedi & Bresson, 2021) as base models for three downstream tasks, namely node classification, graph classification and link prediction task. Due to space constraints, the results of the link prediction task are presented in Appendix D. Following the work of Zheng et al. (2021), we compare our method with vanilla methods and two general defense techniques: layer normalization (LN) and adversarial training (AT). We refer readers to Appendix G for implementation details. Evaluation Metrics. For assessing the model s performance, we utilize the F1-score as a primary metric. To comprehensively assess the stability of the model when encountering various forms of perturbations and randomness, as well as its ability to maintain the interpretability of attention, we present graph-based Jensen-Shannon Divergence (g-JSD) and Total Variation Distance (g-TVD) as metrics for measuring model stability. Furthermore, we design two additional novel metrics, namely, F + slope and F slope, to fully evaluate the interpretability of the graph attention-based neural networks. 5.2 Stability Evaluation 5.3 Interpretability Evaluation 5.4 Visualization Results 5.5 Ablation Study and Computation Cost
Researcher Affiliation	Collaboration	Lijie Hu EMAIL Provable Responsible AI and Data Analytics (PRADA) Lab King Abdullah University of Science and Technology Tianhao Huang EMAIL Provable Responsible AI and Data Analytics (PRADA) Lab Nankai University Lu Yu EMAIL Ant Group Wanyu Lin EMAIL Department of Computing Hong Kong Polytechnic University Tianhang Zheng EMAIL The State Key Laboratory of Blockchain and Data Security Zhejiang University Di Wang EMAIL Provable Responsible AI and Data Analytics (PRADA) Lab King Abdullah University of Science and Technology
Pseudocode	Yes	Algorithm 1 FGAI 1: Input: Graph h = {h1, h2, . . . , h N}, attention weight w. 2: Initialize faithful attention layer w0. 3: for t = 1, 2, , T do 4: Initialize δ0, ρ0. 5: for p = 1, 2, , P do 6: Update δ using PGD. 7: δp = δp 1 + γp 1 \|Bp\| h Bp D(y(h, wt 1), y(h, wt 1 + δp 1)). 8: δ p = arg min \|\|δ\|\| R \|\|δ δp\|\|. 9: end for 10: for q = 1, 2, , Q do 11: Update ρ using PGD. 12: ρq = ρq 1 τq P Lk2( wt 1, wt 1 + ρq 1). 13: ρ q = arg min \|\|ρ\|\| R \|\|ρ ρq\|\|. 14: end for 15: Update w using Stochastic Gradient Descent. wt = wt 1 ηt X [ D(y(h, wt 1), y(h, w)) λ1 Lk1(w, wt 1) + λ2 D(y(h, wt 1), y(h, wt 1 + δ P )) λ3 Lk2( wt 1, wt 1 + ρ Q)]. 16: end for 17: Return: w = w T .
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a link to a code repository. It mentions
Open Datasets	Yes	Datasets. In this study, we employ several datasets encompassing small to large-scale graphs to conduct an exhaustive comparison of our approach with the baseline methods. For the node classification task, we utilize Amazon CS and Amazon Photo (Mc Auley et al., 2015), Coauthor CS and Coauthor Physics (Shchur et al., 2018), ogbn-ar Xiv (Wang et al., 2020; Mikolov et al., 2013). For the link prediction task, we adopt Cora, Citeseer and Pubmed (Sen et al., 2008). Finally, for the graph classification task, we make use of the D&D (Debnath et al., 1991), MUTAG (Dobson & Doig, 2003) and Politifact (Dou et al., 2021) dataset. We present the statistics of the selected datasets in Table 3, 4 and 5. We also provide a detailed description of these datasets in Appendix C.
Dataset Splits	Yes	Table 3: Statistics of the datasets used for the node classification task. Amazon-Photo Amazon-CS Coauthor-CS Coauthor-Physics ogbn-ar Xiv #Nodes 7,650 13,752 18,333 34,493 169,343 #Edges 119,043 245,778 81,894 247,962 1,166,243 #Node Features 745 767 6,805 8,415 128 #Classes 8 10 15 5 40 #Training Nodes 765 1,375 1,833 3,449 90,941 #Validation Nodes 765 1,375 1,833 3,449 29,799 #Test Nodes 6,120 11,002 14,667 27,595 48,603 Table 4: Statistics of the datasets used for graph classification task. D&D MUTAG Politifact #Graphs 1,178 188 314 #Classes 2 3 2 #Avg. Nodes 284.32 17.93 130.74 #Avg. Edges 715.66 19.79 129.75 #Training Graphs 706 112 188 #Validation Graphs 118 18 31 #Test Graphs 354 58 95 Table 5: Statistics of the datasets used for the link prediction task. Cora Pubmed Citeseer #Nodes 2,708 19,717 3,327 #Edges 9,502 79,784 8,194 #Node Features 1,433 500 3,703 #Training Links 4,488 37,676 3,870 #Validation Links 526 4,432 454 #Test Links 1,054 8,864 910
Hardware Specification	Yes	We test the time overhead and GPU memory usage of some popular graph defense methods, namely Pro-GNN Jin et al. (2020), GNN-SVD Entezari et al. (2020) and GNNGuard Zhang & Zitnik (2020). The results are presented in Table 10. We observe that, compared to these methods, FGAI is a more efficient approach. To our knowledge, a primary reason for this discrepancy is that defense methods like GNN-SVD and GNNGuard require calculations on the dense form of the adjacency matrix of the graph. Consequently, this significantly limits their applicability on large-scale datasets. On ogbn-ar Xiv dataset (169,343 nodes, 1,166,243 edges), all these methods exceed the GPU memory limit (32Gi B). In contrast, our method incurs relatively minimal additional overhead compared to the vanilla model, offering an efficient and cost-saving solution for the graph defense community.
Software Dependencies	No	The paper mentions 'Adam as the optimizer' but does not specify a version number for it or any other software libraries used. No specific software with versions is mentioned.
Experiment Setup	Yes	Regarding the architecture of the vanilla model, for the ogbn-arxiv dataset, we configure GAT, GATv2 and GT with three layers and 8 attention heads. The hidden layer dimension is set to 128. On other datasets, we use one layer and an 8-dimensional hidden layer with 8 attention heads for GAT and GATv2 and an 128-dimensional hidden layer with 8 attention heads for GT. A consistent feature dropout rate and an attention dropout rate of 0.6 are applied to all datasets. We use Adam as the optimizer. The learning rate is fixed at 0.01, and an L2 regularization is set to 0.0005. During the training of FGAI, we maintained the same parameters as the vanilla model. For the Coauthor-CS and Coauthor-Physics datasets, we set λ1, λ2, and λ3 to 1, 1, and 1, respectively, with K values of 100,000 and 200,000, respectively. For the Amazon-CS and Amazon-Photo datasets, λ1, λ2, and λ3 were set to 0.8, 1, and 1, and K was set to 100,000. On the ogbn-ar Xiv dataset, we used λ1, λ2, and λ3 values of 1, 10, and 5, respectively, with K set to 600,000.