reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmarking Graph Neural Networks

Authors: Vijay Prakash Dwivedi, Chaitanya K. Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, Xavier Bresson

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is ﬂexible for researchers to experiment with new theoretical ideas. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting. Keywords: Graph Neural Networks, Benchmarking, Graph Datasets, Exploration Tool
Researcher Affiliation	Academia	1Nanyang Technological University, Singapore, 2University of Cambridge, UK, 3Loyola Marymount University, USA, 4Mila, University of Montréal, Canada, 5National University of Singapore
Pseudocode	Yes	We proposed the use of Laplacian eigenvectors (Belkin and Niyogi, 2003) as node positional encoding by building on top of corresponding dataset ﬁles in the data module as shown in the pseudo-code snippet alongside. In other words, the positional encoding pi for a node i can be added to its features xi as xi = xi + pi. Figure 2: Primary code block in data module to implement Graph PE.
Open Source Code	Yes	release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is ﬂexible for researchers to experiment with new theoretical ideas. As of December 2022, the Git Hub repository1 has reached 2,000 stars and 380 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. 1. The framework is hosted at https://github.com/graphdeeplearning/benchmarking-gnns.
Open Datasets	Yes	a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is ﬂexible for researchers to experiment with new theoretical ideas. As of December 2022, the Git Hub repository1 has reached 2,000 stars and 380 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights.
Dataset Splits	Yes	ZINC has 10, 000 train, 1, 000 validation and 1, 000 test graphs. Splitting. We provide a scaﬀold splitting (Hu et al., 2020) of the dataset in the ratio 8 : 1 : 1 to have 7, 831 train, 996 validation and 996 test graphs. Splitting. We use the realistic training, validation and test edge splits provided by OGB. Splitting. We follow the splitting deﬁned in Mernyei and Cangea (2020) that has 20 diﬀerent training, validation and early stopping splits consisting of 5% nodes, 22.5% nodes and 22.5% nodes of each class respectively. Splitting. We use the standard splits of MNIST and CIFAR10. MNIST has 55, 000 train, 5, 000 validation, 10, 000 test graphs and CIFAR10 has 45, 000 train, 5, 000 validation, 10, 000 test graphs. Splitting. The PATTERN dataset has 10, 000 train, 2, 000 validation, 2, 000 test graphs and CLUSTER dataset has 10, 000 train, 1, 000 validation, 1, 000 test graphs. Splitting. TSP has 10, 000 train, 1, 000 validation and 1, 000 test graphs. Splitting. We perform a 5-fold cross validation split, following Murphy et al. (2019), which gives 5 sets of train, validation and test data indices in the ratio 3 : 1 : 1. Splitting. Therefore, the resulting CYCLES dataset has 9,000 train/ 1,000 validation/10,000 test graphs with all the sets having class-balanced samples. Splitting. We use the same splitting sets as in Corso et al. (2020) which has 5,120 train, 640 validation, 1,280 test graphs. Splitting. Since the 3 TU datasets that we use do not have standard splits, we perform a 10-fold cross validation split which gives 10 sets of train, validation and test data indices in the ratio 8 : 1 : 1.
Hardware Specification	Yes	All experiments were implemented in DGL/Py Torch. We run experiments for MNIST, CIFAR10, ZINC, AQSOL, TSP, COLLAB, Wiki CS, CSL, CYCLES, Graph Theory Prop and TUs on an Intel Xeon CPU E5-2690 v4 server with 4 Nvidia 1080Ti GPUs (11 GB), and for PATTERN and CLUSTER on an Intel Xeon Gold 6132 CPU with 4 Nvidia 2080Ti (11 GB) GPUs. Each experiment was run on a single GPU and 4 experiments were run on the server at any given time (on diﬀerent GPUs).
Software Dependencies	No	All experiments were implemented in DGL/Py Torch. Our benchmarking infrastructure builds upon Py Torch (Paszke et al., 2019) and DGL (Wang et al., 2019).
Experiment Setup	Yes	Training. We use the Adam optimizer (Kingma and Ba, 2014) with the same learning rate decay strategy for all models. An initial learning rate is selected in {10 2, 10 3, 10 4} which is reduced by half if the validation loss does not improve after a ﬁxed number of epochs, in the range 5-25. We do not set a maximum number of epochs the training is stopped either when the learning rate has reached the small value of 10 6, or the computational time reaches 12 hours. We run each experiment with 4 diﬀerent seeds and report the statistics of the 4 results. Parameter budgets. Our goal is not to ﬁnd the optimal set of hyperparameters for a speciﬁc GNN model (which is computationally expensive), but to compare and benchmark the model and/or their building blocks within a budget of parameters. Therefore, we decide on using two parameter budgets: (1) 100k parameters for each GNNs for all the tasks, and (2) 500k parameters for GNNs for which we investigate scaling a model to larger parameters and deeper layers.