Towards Efficient Training of Graph Neural Networks: A Multiscale Approach

Authors: Eshed Gal, Moshe Eliasof, Carola-Bibiane Schönlieb, Ivan Kyrchei, Eldad Haber, Eran Treister

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also provide some theoretical analysis of our methods and demonstrate their effectiveness across various datasets and learning tasks. Our results show that multiscale training can substantially accelerate GNN training for large-scale problems while maintaining, or even improving, predictive performance.
Researcher Affiliation Academia Eshed Gal EMAIL Faculty of Computer and Information Science Ben-Gurion University of the Negev Moshe Eliasof EMAIL Department of Applied Mathematics and Theoretical Physics University of Cambridge Carola-Bibiane Schönlieb EMAIL Department of Applied Mathematics and Theoretical Physics University of Cambridge Ivan I. Kyrchei ivankyrchei26@gmailcom Pidstryhach Institute for Applied Problems of Mechanics and Mathematics NAS of Ukraine, L viv, Ukraine Eldad Haber EMAIL Department of Earth, Ocean and Atmospheric Sciences University of British Columbia Eran Treister EMAIL Faculty of Computer and Information Science Ben-Gurion University of the Negev
Pseudocode Yes Algorithm 1 Multiscale algorithm Algorithm 2 Multiscale Gradients Computation Algorithm 3 Random Coarsening Algorithm 4 Topk Coarsening Algorithm 5 Subgraph Coarsening
Open Source Code Yes Our code is available here: https://github.com/eshedgal1/Graph Multiscale.
Open Datasets Yes We evaluate our approach using the OGBN-Arxiv dataset (Hu et al., 2020) with GCN (Kipf & Welling, 2016), GIN Xu et al. (2018), and GAT (Velickovic et al., 2017). We demonstrate our results using the OGBN-MAG dataset (Hu et al., 2020), which is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG). We further test our methods on Flickr (Zeng et al., 2019), Wiki CS (Mernyei & Cangea, 2020), DBLP (Bojchevski & Günnemanng, 2017), the transductive versions of Facebook, Blog Catalog, and PPI (Yang et al., 2020) datasets We evaluate our method on the Shape Net dataset (Chang et al., 2015), a point cloud dataset with multiple categories, for a node segmentation task. Additionally, we construct a second synthetic dataset derived from the MNIST image dataset (Le Cun et al., 1998). We further evaluate our Multiscale Gradients Computation method on the protein-protein interaction (PPI) dataset (Zitnik & Leskovec, 2017). Furthermore, Appendix I also presents results on the NCI1 (Wale et al., 2008) and ogbg-molhiv (Hu et al., 2020) datasets.
Dataset Splits Yes We evaluate our method on the Cora, Citeseer, and Pub Med datasets (Sen et al., 2008) using the standard split from Yang et al. (2016), with 20 nodes per class for training, 500 validation nodes, and 1,000 testing nodes. The PPI dataset Zitnik & Leskovec (2017), is a biological graph dataset where nodes represent proteins... includes 20 graphs for training, 2 for validation, and 2 for testing, with an average of 2,372 nodes per graph.
Hardware Specification No Timing computation GPU performance. An average of 100 epochs after training was stabled. Timing was calculated using GPU computations, and we averaged across 5 epochs after the training loss was stabilized.
Software Dependencies No Network architecture GCN Kipf & Welling (2016) with 4 layers and 192 hidden channels. GIN Xu et al. (2018) with 3 layers and 256 hidden channels. GAT Velickovic et al. (2017) with 3 layers and 64 hidden channels, using 2 heads Loss function Negative Log Likelihood Loss except for PPI (transductive) and Facebook Yang et al. (2020) which use Cross Entropy Loss Optimizer Adam Kingma & Ba (2014) Learning rate 1 10 3. The paper mentions various architectures and optimizers but does not provide specific version numbers for software libraries like PyTorch, TensorFlow, or Python itself.
Experiment Setup Yes We evaluate multiscale training with 2, 3, and 4 levels of coarsening, reducing the graph size by a factor of 2 at each level. Training epochs are doubled at each coarsening step, while fine-grid epochs remain fewer than in standard training. For multiscale gradient computation, we coarsen the graph to retain 75% of the nodes, perform half of the training using the coarsened graph, and then transition to fine-grid training. Table 18: Experimental setup for transductive learning datasets. Network architecture GCN Kipf & Welling (2016) with 4 layers and 192 hidden channels. GIN Xu et al. (2018) with 3 layers and 256 hidden channels. GAT Velickovic et al. (2017) with 3 layers and 64 hidden channels, using 2 heads Loss function Negative Log Likelihood Loss except for PPI (transductive) and Facebook Yang et al. (2020) which use Cross Entropy Loss Optimizer Adam Kingma & Ba (2014) Learning rate 1 10 3 Baseline training 2000 epochs Mulstiscale gradients level 2, 3, and 4 levels Coarse-to-fine strategy Each coarsening reduces the number of nodes by half compared to the previous level Sub-to-Full strategy Using ego-networks Gupta et al. (2014). 6 hops on level 2, 4 on level 3, and 2 on level 4 Multiscale training strategy Using [1000, 2000] epochs for 2 levels, [800, 1600, 3200] epochs for 3 levels, and [600, 1200, 2400, 4800] epochs for 4 levels (the first number is the fine grid epoch number).