reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable Learning of Bayesian Network Classifiers

Authors: Ana M. Martínez, Geoffrey I. Webb, Shenglei Chen, Nayyar A. Zaidi

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classiﬁcation performance, and substantially better training and classiﬁcation time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.
Researcher Affiliation	Academia	Ana M. Mart ınez EMAIL Geoﬀrey I. Webb EMAIL Faculty of Information Technology Monash University VIC 3800, Australia ... Shenglei Chen EMAIL College of Information Science/Faculty of Information Technology Nanjing Audit University/Monash University China/Australia ... Nayyar A. Zaidi EMAIL Faculty of Information Technology Monash University VIC 3800, Australia
Pseudocode	Yes	Algorithm 1: The KDB algorithm ... Algorithm 2: learn Structure(T) ... Algorithm 3: learn Parameters(T, G) ... Algorithm 4: The SKDB algorithm
Open Source Code	Yes	A minimum functional part of the software containing SKDB can be found in the (two ﬁrst) authors academic web pages, for example http://www.csse.monash.edu.au/~webb/.
Open Datasets	Yes	We undertook an extensive online search to gather a group of large datasets, all of which have more than 100K instances. These datasets are described in Table 7 in Appendix A ... 2 census-income 0.299 41 2 Y (50%) (4.9 14.6%) UCI Bache and Lichman (2013) Oza and Russell (2001) Weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. 136MB
Dataset Splits	Yes	Note that both RMSE and 0-1 Loss are assessed using 10-fold cross-validation. This should not be confused with the leave-one-out cross-validation used to select the number of attributes and parents in SKDB. For each fold of the 10-fold cross-validation, SKDB performs leave-one-out cross-validation on the training set to select the parameters of the model that is then tested on the holdout test fold.
Hardware Specification	Yes	Only the 11 smallest out of the 16 datasets have been considered for time measurement, since these experiments have been conducted on a desktop computer with an Intel(R) Core(TM) i5-2400 CPU @ 3.10 GHz, 3101 MHz, 64 bits and 7866 Mi B of memory, whereas the remaining experiments have been conducted in a heterogeneous grid environment for which CPU times are not commensurable.
Software Dependencies	No	All the experiments for the out-of-core algorithms have been carried out in a C++ software specially designed to deal with out-of-core classiﬁcation methods. ... We use LRSGD s implementation in Vowpal Wabbit (VW) (Agarwal et al., 2014), an open source out-of-core linear learning system. ... Weka s implementation has been considered.
Experiment Setup	Yes	For the Bayesian network classiﬁers we discretize numeric attributes using 5-bin equal frequency discretization. ... We use SKBD with RMSE as the objective function for the third pass that selects between structures ... Two combined techniques are considered for smoothing. In the ﬁrst place, we use m-estimates (Mitchell, 1997) as follows: p(xi\|πxi) = counts(xi, πxi) + m / \|Xi\| counts(πxi) + m (4) where πxi are the parent-values of Xi and m = 1. ... We have conducted experiments with RF selecting 100 trees.