Scalable Learning of Bayesian Network Classifiers

Authors: Ana M. Martínez, Geoffrey I. Webb, Shenglei Chen, Nayyar A. Zaidi

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.
Researcher Affiliation Academia Ana M. Mart ınez EMAIL Geoffrey I. Webb EMAIL Faculty of Information Technology Monash University VIC 3800, Australia ... Shenglei Chen EMAIL College of Information Science/Faculty of Information Technology Nanjing Audit University/Monash University China/Australia ... Nayyar A. Zaidi EMAIL Faculty of Information Technology Monash University VIC 3800, Australia
Pseudocode Yes Algorithm 1: The KDB algorithm ... Algorithm 2: learn Structure(T) ... Algorithm 3: learn Parameters(T, G) ... Algorithm 4: The SKDB algorithm
Open Source Code Yes A minimum functional part of the software containing SKDB can be found in the (two first) authors academic web pages, for example http://www.csse.monash.edu.au/~webb/.
Open Datasets Yes We undertook an extensive online search to gather a group of large datasets, all of which have more than 100K instances. These datasets are described in Table 7 in Appendix A ... 2 census-income 0.299 41 2 Y (50%) (4.9 14.6%) UCI Bache and Lichman (2013) Oza and Russell (2001) Weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. 136MB
Dataset Splits Yes Note that both RMSE and 0-1 Loss are assessed using 10-fold cross-validation. This should not be confused with the leave-one-out cross-validation used to select the number of attributes and parents in SKDB. For each fold of the 10-fold cross-validation, SKDB performs leave-one-out cross-validation on the training set to select the parameters of the model that is then tested on the holdout test fold.
Hardware Specification Yes Only the 11 smallest out of the 16 datasets have been considered for time measurement, since these experiments have been conducted on a desktop computer with an Intel(R) Core(TM) i5-2400 CPU @ 3.10 GHz, 3101 MHz, 64 bits and 7866 Mi B of memory, whereas the remaining experiments have been conducted in a heterogeneous grid environment for which CPU times are not commensurable.
Software Dependencies No All the experiments for the out-of-core algorithms have been carried out in a C++ software specially designed to deal with out-of-core classification methods. ... We use LRSGD s implementation in Vowpal Wabbit (VW) (Agarwal et al., 2014), an open source out-of-core linear learning system. ... Weka s implementation has been considered.
Experiment Setup Yes For the Bayesian network classifiers we discretize numeric attributes using 5-bin equal frequency discretization. ... We use SKBD with RMSE as the objective function for the third pass that selects between structures ... Two combined techniques are considered for smoothing. In the first place, we use m-estimates (Mitchell, 1997) as follows: p(xi|πxi) = counts(xi, πxi) + m / |Xi| counts(πxi) + m (4) where πxi are the parent-values of Xi and m = 1. ... We have conducted experiments with RF selecting 100 trees.