Graph Knowledge Distillation to Mixture of Experts
Authors: Pavel Rumiantsev, Mark Coates
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a series of experiments showing that our approach can be efficiently and effectively applied to datasets of various sizes. To evaluate our model, we explore both transductive and inductive settings for 9 publicly available datasets. We evaluate our model on nine real-world datasets. We show that our model can utilize additional parameters more efficiently than a parameter-inflated MLP, an ensemble of MLPs, or a vanilla mixture-of-experts model. We conduct an ablation study to show how the various loss terms influence accuracy. |
| Researcher Affiliation | Academia | Pavel Rumiantsev EMAIL The Department of Electrical and Computer Engineering Mc Gill University Mark Coates EMAIL The Department of Electrical and Computer Engineering Mc Gill University |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and descriptive text, but it does not include a distinct section labeled "Pseudocode" or "Algorithm", nor does it present any formatted code blocks. |
| Open Source Code | Yes | Code available at https://github.com/Rufaim/ routing-by-memory. |
| Open Datasets | Yes | To conduct our experiments we use nine real-world datasets: Cora (Sen et al., 2008), Citeseer (Giles et al., 1998), Pubmed (Mc Callum et al., 2000), Amazon-Photo, Amazon-Computers, Academic CS, Academic-Physics (Shchur et al., 2018), OGB-Ar Xive and OGB-Products (Hu et al., 2020). |
| Dataset Splits | Yes | For the Cora, Citeseer, and Pubmed datasets, we follow the data splitting strategy specified by Kipf & Welling (2016). For the Amazon-Photo, Amazon-Computers, Academic-CS, Academic-Physics, we follow the procedure employed by Zhang et al. (2021b), Tian et al. (2022) and Wu et al. (2023). We randomly split the data into train/val/test subsets. Each random seed corresponds to a different data split. For the OGB-Ar Xive and OGB-Products we use the public data splits provided by Hu et al. (2020). For the inductive setting, we split the unlabeled nodes, VU, into a set of observed nodes, VU obs, and a set of inductive nodes, VU ind, by randomly selecting 20% of the nodes as the inductive subset, following the procedure of Tian et al. (2022) and Zhang et al. (2021b) |
| Hardware Specification | Yes | Our experiments were conducted using an NVIDIA Tesla V100 GPU with 32GB of memory. The machine has an Intel Xeon Gold 6140 CPU with clock frequency of 2.30GHz and total thread count of 36. |
| Software Dependencies | No | We use Ray Tune (Liaw et al., 2018) to tune model hyperparameters. Specifically, we use the Optuna search algorithm Akiba et al. (2019). We use the Adam optimizer (Kingma & Ba, 2014). |
| Experiment Setup | Yes | We tuned the following model structure hyperparameters: (i) dropout rate was selected from [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6] and applied to all dropout layers in the model; (ii) total number of experts was selected from [4, 5, 6, 7, 8]. In addition to the structure hyperparameters, we selected the following training hyperparameters: (i) learning rate for Adam optimizer (Kingma & Ba, 2014) was chosen from [0.01, 0.005, 0.001]; (ii) weight α of the commitment loss (6) from the range [0.0, 0.1]; (iii) weights β and γ of the the load-balancing loss (8) and self-similarity loss (7) correspondingly from the range [0.0, 0.05]. In our experiments, we set λ0 = 0.9, T = 200 and = 0.05. |