Cooperative Minibatching in Graph Neural Networks
Authors: Muhammed Fatih Balin, Dominique LaSalle, Umit Catalyurek
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental evaluations show up to 4x bandwidth savings for fetching vertex embeddings, by simply increasing this dependency without harming model convergence. Combining our proposed approaches, we achieve up to 64% speedup over Independent Minibatching on single-node multi-GPU systems, using same resources. ... 4 Experimental Evaluation We first compare how the work to process an epoch changes w.r.t. to the batch size to empirically validate Theorems 3.1 and 3.2 for different graph sampling algorithms. Next, we show how dependent batches introduced in Section 3.2 benefits GNN training. We also show the runtime benefits of cooperative minibatching compared to independent minibatching in the multi-GPU setting. Finally, we show that these two techniques are orthogonal, can be combined to get multiplicative savings. |
| Researcher Affiliation | Collaboration | Muhammed Fatih Balın EMAIL School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA, USA Dominique La Salle EMAIL NVIDIA Corporation, Santa Clara, CA, USA Ümit V. Çatalyürek EMAIL School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA, USA |
| Pseudocode | Yes | Algorithm 1 Cooperative minibatching Input: seed vertices S0 p for each PE p P, # layers L for all l {0, . . . , L 1} do {Sampling} for all p P do in parallel Sample next layer vertices Sl+1 p and edges El p for Sl p all-to-all to redistribute vertex ids for Sl+1 p to get Sl+1 p for all p P do in parallel {Feature Loading} |
| Open Source Code | Yes | Source code is available at https://github.com/GT-TDAlab/dgl-coop/tree/dist_graph_squashed_wip_cache |
| Open Datasets | Yes | In our experiments, we use the following datasets: reddit (Hamilton et al., 2017), papers100M (Hu et al., 2020a), mag240M (Hu et al., 2021), yelp and flickr (Zeng et al., 2020), and their details are given in Table 2. |
| Dataset Splits | Yes | Table 2: Traits of datasets used in experiments: numbers of vertices, edges, avg. degree, features, cached vertex embeddings, and training, validation and test vertex splits. Last column has # minibatches in an epoch during model training with 1024 batch size including validation. ... flickr 89.2K 900K 10.09 500 70k 50.00 25.00 25.00 65 ... yelp ... 75.00 10.00 15.00 ... reddit ... 66.00 10.00 24.00 ... papers100M ... 1.09 0.11 0.19 ... mag240M ... 0.45 0.06 0.04 |
| Hardware Specification | Yes | We present our runtime results on systems equipped with NVIDIA GPUs, with 4 and 8 A100 80 GB (NVIDIA, 2021) and 16 V100 32GB (NVIDIA, 2020b), all with NVLink interconnect between the GPUs (600 GB/s for A100 and 300 GB/s for V100). |
| Software Dependencies | No | We implemented our experimental code using C++ and Python in the DGL framework (Wang et al., 2019) with the Pytorch backend (Paszke et al., 2019). No specific version numbers for C++, Python, DGL, or Pytorch are provided. |
| Experiment Setup | Yes | All our experiments involve a GCN model with L = 3 layers (Hamilton et al., 2017), with 1024 hidden dimension for mag240M and papers100M and 256 for the rest. Additionally, papers100M and mag240M datasets were made undirected graphs for all experiments and this is reflected in the reported edge counts in Table 2. Input features of mag240M are stored with the 16-bit floating point type. We use the Adam optimizer (Kingma & Ba, 2014) with 10 3 learning rate in all the experiments. ... We used a fanout of k = 10 for the samplers. In addition, Random Walks used length of o = 3, restart probability p = 0.5 and number of random walks from each seed a = 100. |