reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reconsidering Faithfulness in Regular, Self-Explainable and Domain Invariant GNNs

Authors: Steve Azzolin, Antonio Longa, Stefano Teso, Andrea Passerini

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report in Table 2 the Suf values and ranking of explanations produced by three popular modular GNNs (see Table 4) on the Motif2 (Gui et al., 2023) dataset. This comes with in-distribution (ID) and out-of-distribution (OOD) splits, allowing us to sample perturbations from different graph distributions. We consider different distributions p R, as follows: pid1 R and pood1 R allow i) replacing the complement CA = GA \RA of the input graph with that of another sample G A = C A R A taken from the same split, and ii) removing random edges from CA. pid2 R and pood2 R only subsample the complement of each graph by randomly removing a fixed budget of edges.
Researcher Affiliation	Academia	University of Trento, {name.surname}@unitn.it
Pseudocode	Yes	Algorithm 1 Testing for non strict sufficiency
Open Source Code	Yes	The code is publicly available on Git Hub1. 1https://github.com/steveazzolin/reconsidering-faithfulness-in-gnns
Open Datasets	Yes	Ba MS (Azzolin et al., 2022) is a synthetic dataset consisting of 1,000 Barabasi-Albert (BA) graphs... Motif2-Basis (Gui et al., 2023) is a synthetic dataset comprising 24,000 graphs... BBBP (Wu et al., 2018) is a dataset derived from a study on modeling and predicting barrier permeability (Martins et al., 2012)... CMNIST-Color (Gui et al., 2022) contains 70,000 graphs... LBAPcore-Assay (Gui et al., 2023) is a molecular dataset consisting of 34,179 graphs... SST2-Length is a sentiment analysis dataset based on the NLP task of sentiment analysis, adapted from the work of Yuan et al. (Yuan et al., 2022).
Dataset Splits	Yes	Motif2-Basis (Gui et al., 2023) is a synthetic dataset comprising 24,000 graphs... The dataset is divided into training (18,000 graphs), validation (3,000 graphs), and test sets (3,000 graphs). In the context of OOD analysis, two additional sets are considered: the OOD validation set and the OOD test set... Motif-Size (Gui et al., 2022) is a synthetic dataset consisting of 24,000 graphs... The dataset is divided into training (18,000 graphs), validation (3,000 graphs), and test sets (3,000 graphs)... CMNIST-Color (Gui et al., 2022) contains 70,000 graphs... In the training set, which contains 50,000 graphs, the digits are colored using five different colors. To evaluate the model s performance on out-of-distribution data, the validation and testing set each contain 10,000 graphs... SST2-Length is a sentiment analysis dataset... The dataset comprises 70,042 graphs, divided into training, validation, and test sets. The out-of-distribution (OOD) validation and test sets are specifically created to evaluate performance on data with longer sentence lengths.
Hardware Specification	No	The original implementation of the top K operator to exhibit instabilities when used on GPU, in particular in the presence of equal scores for which alternatively the first or the last elements of the tensor are returned.
Software Dependencies	No	CIGA, LECI and GSAT are developed based on the repository from Gui et al. (2023), using commit fb39550453b4160527f0dcf11da63de43a276ad5... we average the edge scores via torch_geometric.to_undirected.
Experiment Setup	Yes	To encourage reproducibility, we stick to the hyperparameters provided in each respective repository, except for GSAT on BBBP and Ba MS where we set the values of ood_param to 0.5 and extra_param to [True, 10, 0.2], and for Motif2-Basis and Motif-Size in Table 5 where we set the values of ood_param to 10 and extra_param to [True, 10, 0.2]. Model selection was performed on the ID validation set. Specifically, we chose the budget b as a fixed proportion of the average number of undirected edges for each split of the dataset, where the proportion ratio is set to 5% in all our experiments. For each explanation, we sample a number q1 of perturbed graphs, where q1 is fixed at 8 in our experiments... Faith corresponds to the harmonic mean of the normalized Nec and normalized Suf scores, where we set d as the L1 divergence for both metrics, and it is computed over a subset of q2 of input graphs, which is set to 800 in our experiments. Since most modular GNNs output soft edge scores, we extract the relevant subgraph via Top K selection, where the size ratios vary in {0.3, 0.6, 0.9}.