reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Probabilistic Learning on Graphs via Contextual Architectures

Authors: Davide Bacciu, Federico Errica, Alessio Micheli

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate its eﬀectiveness, we target graph and node classiﬁcation, and we obtain competitive performances when compared to expensive kernels and neural models, thus empirically showing the richness of the information captured by the model. Notably, computational complexity is linear in the number of edges, and we can achieve ﬁne-grained scalability by distributing each vertex s job across diﬀerent computational resources.
Researcher Affiliation	Academia	Davide Bacciu EMAIL Federico Errica EMAIL Alessio Micheli EMAIL Department of Computer Science University of Pisa 56127, PI, Italy
Pseudocode	Yes	To summarize this Section, we provide a pseudo-code of the incremental training procedure in Algorithm 1, showing how the ideas described so far are put together to construct vertex (line 14) and graph (line 15) representations.
Open Source Code	Yes	1. The implementation is available here: https://github.com/diningphil/CGMM.
Open Datasets	Yes	We extend the comparison in Bacciu et al. (2018a) with six common graph classiﬁcation benchmarks (Kersting et al., 2020) and one vertex classiﬁcation task. As regards graph classiﬁcation, three data sets come from the chemical domain, while the others are social network data sets. D&D (Dobson and Doig, 2003) is a data set of proteins in which we have to distinguish between enzymes and non-enzymes; this data set is computationally challenging for kernels because of the high connectivity and size of each graph. Similarly, in PROTEINS (Borgwardt et al., 2005), vertices are connected if they are neighbours in the amino-acid sequence or 3D space. The last chemical data set, NCI1 (Wale et al., 2008), is the biggest in terms of the number of graphs. It contains two balanced subsets of chemical compounds that need to be classiﬁed according to their ability to suppress the growth of a panel of human tumour cell lines. Instead, IMDB-BINARY and IMDB-MULTI (Yanardag and Vishwanathan, 2015) are movie-collaboration data sets where vertices represent actors/actresses, and there is an edge every time two of them appeared in the same movie; the task is to classify the associated movie genre. Finally, COLLAB (Leskovec et al., 2005) is a scientiﬁc-collaboration benchmark derived from 3 other data sets, and the task is to classify the speciﬁc research ﬁeld each community belongs to. To test CGMM performance on vertex classiﬁcation tasks, we use the protein-protein interaction data set (PPI) introduced in Hamilton et al. (2017).
Dataset Splits	Yes	To assess CGMM s classiﬁcation performance, we used a Double Cross-Validation with 10 external folds and 5 internal folds. The internal folds are needed for grid-search model selection, which returns a biased estimate (optimization is done on validation folds) of the best hyper-parameters for each external split. Then, these conﬁgurations are retrained on the corresponding external training fold (using a validation set to do early stopping) and evaluated on the external test fold. We repeat each ﬁnal retraining three times and average results in order to mitigate possible bad random initializations. The output of this procedure is an estimate of CGMM s performance. Importantly, the test fold was never seen throughout the entire process; unfortunately, we have observed this is not always the case in DNGNs literature, where model selection s best results, i.e., those of a k-fold crossvalidation, are reported. This trend is also criticized in Klicpera et al. (2019). Consequently, we compare with results from Tran et al. (2018), where the authors performed an external 10-fold cross-validation with a simple hold-out internal model selection procedure. Their process is repeated ten times with diﬀerent splits, and results are averaged. This approach, which is similar to bootstrapping, is less subject to unlucky splits with consequent variance reduction; on the other hand, it might be more biased than a k-fold cross validation (Kohavi, 1995), which is why we use the latter. Despite these diﬀerences, both evaluations are robust and comparable estimates of a model s performance. A train/validation/test split was already provided for PPI, so we used a standard hold-out technique for this task.
Hardware Specification	No	Also, data parallelism could be trivially achieved by distributing the epoch s mini-batches on diﬀerent computing vertices, such as CPUs or a cluster of machines. For these reasons, CGMM is a suitable candidate to handle large-scale graph learning. Up to now, our implementation does not exploit intra-epoch parallelization: each training epoch was just fast enough for our purposes.
Software Dependencies	No	The paper mentions training with Adam (Kingma and Ba, 2015) optimizer and Cross-Entropy loss (Mean Squared Error for PPI) and refers to PyTorch Geometric as a reference, but does not specify versions of any software used for their own implementation.
Experiment Setup	Yes	We now list the hyper-parameters needed by CGMM and by the supervised classiﬁer working on graph and vertex representations. We achieve constant per-layer complexity by setting \|L\| = 1, but we also evaluate the impact of all previous layers at the cost of additional computation. The number of EM epochs is ﬁxed to 10: the reason is that likelihood always stabilizes around that value in our experiments. Moreover, we tried to use both continuous and discrete vertex representations, as described in Section 3.3.4. On the contrary, the aggregation function (sum or mean) was usually ﬁxed after an initial experimental screening, which helped to reduce the grid dimension. Overall, the diﬀerent conﬁgurations of CGMM s architecture depended on two hyper-parameters only: the number of hidden states C and the number of layers. Furthermore, we can exploit the incremental nature of our model to reduce the dimension of the grid search space: if the set of depth hyper-parameters has maximum value N, we train a single network of N layers and cut it to obtain diﬀerent architectures of depth M < N. Contrarily to our previous work (Bacciu et al., 2018a), we did not use the pooling strategy of Fahlman and Lebiere (1990): in short, it consists of training a pool of candidate Bayesian Networks at each layer and selecting the best one according to some metric, e.g., validation performances. This choice is motivated by empirical studies that will be presented in Section 5.4. The training time improvement factor is equal to the number of candidates that were used in Bacciu et al. (2018a), i.e., 10 . Another point to consider is whether the learned representations are suﬃcient to linearly separate classes; to this end, we introduced a logistic regressor in the grid search. However, we add a ﬂexible alternative in the form of a one-hidden-layer MLP with Re LU activations. Both of them are trained with Adam (Kingma and Ba, 2015) optimizer and Cross-Entropy loss (Mean Squared Error for PPI); to contrast overﬁtting, we introduced L2 regularization. Also, we used the early stopping technique called Generalization Loss (Prechelt, 1998) with α = 5, which is considered to be a good compromise between training time and performances. In Table 5, we report the number of epochs after which early stopping starts; at the beginning of training, validation loss smoothly oscillates and accuracy does not steadily increase, so it would be unwise to apply this technique immediately. For space reasons, we list the hyper-parameters table in Appendix B.