Probabilistic Learning on Graphs via Contextual Architectures
Authors: Davide Bacciu, Federico Errica, Alessio Micheli
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate its effectiveness, we target graph and node classification, and we obtain competitive performances when compared to expensive kernels and neural models, thus empirically showing the richness of the information captured by the model. Notably, computational complexity is linear in the number of edges, and we can achieve fine-grained scalability by distributing each vertex s job across different computational resources. |
| Researcher Affiliation | Academia | Davide Bacciu EMAIL Federico Errica EMAIL Alessio Micheli EMAIL Department of Computer Science University of Pisa 56127, PI, Italy |
| Pseudocode | Yes | To summarize this Section, we provide a pseudo-code of the incremental training procedure in Algorithm 1, showing how the ideas described so far are put together to construct vertex (line 14) and graph (line 15) representations. |
| Open Source Code | Yes | 1. The implementation is available here: https://github.com/diningphil/CGMM. |
| Open Datasets | Yes | We extend the comparison in Bacciu et al. (2018a) with six common graph classification benchmarks (Kersting et al., 2020) and one vertex classification task. As regards graph classification, three data sets come from the chemical domain, while the others are social network data sets. D&D (Dobson and Doig, 2003) is a data set of proteins in which we have to distinguish between enzymes and non-enzymes; this data set is computationally challenging for kernels because of the high connectivity and size of each graph. Similarly, in PROTEINS (Borgwardt et al., 2005), vertices are connected if they are neighbours in the amino-acid sequence or 3D space. The last chemical data set, NCI1 (Wale et al., 2008), is the biggest in terms of the number of graphs. It contains two balanced subsets of chemical compounds that need to be classified according to their ability to suppress the growth of a panel of human tumour cell lines. Instead, IMDB-BINARY and IMDB-MULTI (Yanardag and Vishwanathan, 2015) are movie-collaboration data sets where vertices represent actors/actresses, and there is an edge every time two of them appeared in the same movie; the task is to classify the associated movie genre. Finally, COLLAB (Leskovec et al., 2005) is a scientific-collaboration benchmark derived from 3 other data sets, and the task is to classify the specific research field each community belongs to. To test CGMM performance on vertex classification tasks, we use the protein-protein interaction data set (PPI) introduced in Hamilton et al. (2017). |
| Dataset Splits | Yes | To assess CGMM s classification performance, we used a Double Cross-Validation with 10 external folds and 5 internal folds. The internal folds are needed for grid-search model selection, which returns a biased estimate (optimization is done on validation folds) of the best hyper-parameters for each external split. Then, these configurations are retrained on the corresponding external training fold (using a validation set to do early stopping) and evaluated on the external test fold. We repeat each final retraining three times and average results in order to mitigate possible bad random initializations. The output of this procedure is an estimate of CGMM s performance. Importantly, the test fold was never seen throughout the entire process; unfortunately, we have observed this is not always the case in DNGNs literature, where model selection s best results, i.e., those of a k-fold crossvalidation, are reported. This trend is also criticized in Klicpera et al. (2019). Consequently, we compare with results from Tran et al. (2018), where the authors performed an external 10-fold cross-validation with a simple hold-out internal model selection procedure. Their process is repeated ten times with different splits, and results are averaged. This approach, which is similar to bootstrapping, is less subject to unlucky splits with consequent variance reduction; on the other hand, it might be more biased than a k-fold cross validation (Kohavi, 1995), which is why we use the latter. Despite these differences, both evaluations are robust and comparable estimates of a model s performance. A train/validation/test split was already provided for PPI, so we used a standard hold-out technique for this task. |
| Hardware Specification | No | Also, data parallelism could be trivially achieved by distributing the epoch s mini-batches on different computing vertices, such as CPUs or a cluster of machines. For these reasons, CGMM is a suitable candidate to handle large-scale graph learning. Up to now, our implementation does not exploit intra-epoch parallelization: each training epoch was just fast enough for our purposes. |
| Software Dependencies | No | The paper mentions training with Adam (Kingma and Ba, 2015) optimizer and Cross-Entropy loss (Mean Squared Error for PPI) and refers to PyTorch Geometric as a reference, but does not specify versions of any software used for their own implementation. |
| Experiment Setup | Yes | We now list the hyper-parameters needed by CGMM and by the supervised classifier working on graph and vertex representations. We achieve constant per-layer complexity by setting |L| = 1, but we also evaluate the impact of all previous layers at the cost of additional computation. The number of EM epochs is fixed to 10: the reason is that likelihood always stabilizes around that value in our experiments. Moreover, we tried to use both continuous and discrete vertex representations, as described in Section 3.3.4. On the contrary, the aggregation function (sum or mean) was usually fixed after an initial experimental screening, which helped to reduce the grid dimension. Overall, the different configurations of CGMM s architecture depended on two hyper-parameters only: the number of hidden states C and the number of layers. Furthermore, we can exploit the incremental nature of our model to reduce the dimension of the grid search space: if the set of depth hyper-parameters has maximum value N, we train a single network of N layers and cut it to obtain different architectures of depth M < N. Contrarily to our previous work (Bacciu et al., 2018a), we did not use the pooling strategy of Fahlman and Lebiere (1990): in short, it consists of training a pool of candidate Bayesian Networks at each layer and selecting the best one according to some metric, e.g., validation performances. This choice is motivated by empirical studies that will be presented in Section 5.4. The training time improvement factor is equal to the number of candidates that were used in Bacciu et al. (2018a), i.e., 10 . Another point to consider is whether the learned representations are sufficient to linearly separate classes; to this end, we introduced a logistic regressor in the grid search. However, we add a flexible alternative in the form of a one-hidden-layer MLP with Re LU activations. Both of them are trained with Adam (Kingma and Ba, 2015) optimizer and Cross-Entropy loss (Mean Squared Error for PPI); to contrast overfitting, we introduced L2 regularization. Also, we used the early stopping technique called Generalization Loss (Prechelt, 1998) with α = 5, which is considered to be a good compromise between training time and performances. In Table 5, we report the number of epochs after which early stopping starts; at the beginning of training, validation loss smoothly oscillates and accuracy does not steadily increase, so it would be unwise to apply this technique immediately. For space reasons, we list the hyper-parameters table in Appendix B. |