Classification of Imbalanced Data with a Geometric Digraph Family

Authors: Artür Manukyan, Elvan Ceyhan

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assess the classification performance of CCCD classifiers by extensive Monte Carlo simulations, comparing them with other classifiers commonly used in the literature. ... Experiments on both simulated and real data sets indicate that CCCD classifiers are robust to the class imbalance problem.
Researcher Affiliation Academia Artur Manukyan EMAIL Graduate School of Sciences and Engineering Koc University Sarıyer, 34450, Istanbul, Turkey; Elvan Ceyhan EMAIL Department of Statistics University of Pittsburgh Pittsburgh, 15260, PA, USA. Both authors are affiliated with universities.
Pseudocode Yes Algorithm 1 The greedy algorithm for finding an approximate minimum dominating set of a digraph D. ... Algorithm 2 The greedy algorithm for finding an approximate minimum cardinality ball cover CX ... Algorithm 3 The greedy algorithm for finding an approximate minimum dominating set for RW-CCCDs of points Xn from the target class given non-target class points Ym.
Open Source Code Yes We employ the cccd, e0171 and RWeka packages in R to classify test data sets with the P-CCCD, SVM (with Gaussian kernel) and C4.5 classifiers, respectively (Marchette, 2013; Meyer et al., 2014; R Core Team, 2015). ... Marchette (2013). cccd: Class Cover Catch Digraphs, 2013. URL http://CRAN.R-project.org/package=cccd. R package version 1.04.
Open Datasets Yes Finally, we apply all these classifiers on several UCI and KEEL data sets. By using the SVDD method of Tax and Duin (2004), we estimated the overlapping ratios of all these data sets. ... UCI Machine Learning and KEEL repositories (Bache and Lichman, 2013; Alcal a-Fdez et al., 2011).
Dataset Splits Yes On each replication, we train the data with equal sizes of observations (m = n) from each class for n = 50, 100, 200, 500. On each replication, we record the AUC measures of the classifiers on the test data set with 100 observations from each class, resulting a test data set of size 200. ... To test the difference between the AUC of classifiers, we employ the 5x2 cross validation (CV) paired t-test (see Dietterich, 1998) ... For each of five repetitions, we divide the data into two folds.
Hardware Specification No Most of the Monte Carlo simulations presented in this article were executed at Koc University High Performance Computing Laboratory. No specific hardware (e.g., CPU/GPU models, memory) is mentioned.
Software Dependencies Yes We employ the cccd, e0171 and RWeka packages in R to classify test data sets with the P-CCCD, SVM (with Gaussian kernel) and C4.5 classifiers, respectively (Marchette, 2013; Meyer et al., 2014; R Core Team, 2015). ... Marchette (2013). cccd: Class Cover Catch Digraphs, 2013. URL http://CRAN.R-project.org/package=cccd. R package version 1.04. ... Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2014. URL http://CRAN.R-project.org/package=e1071. R package version 1.6-4.
Experiment Setup Yes For each of four classification methods other than C4.5, we assign the optimum parameter values which are the best performing values among all considered parameters. For example, an optimum the P-CCCD parameter τ is found in a preliminary (pilot) Monte Carlo simulation study associated with the main simulation setting (i.e., the same setting of the main simulation). ... P-CCCD with the optimum τ (in the pilot study) among τ = 0.0, 0.1, ..., 1.0. RW-CCCD with the optimum e (in the pilot study) among e = 0, 0.1, ..., 1.0. k-NN with optimum k (in the pilot study) among k = 1, 2, ..., 30. SVM with the radial basis function (Gaussian) kernel with the optimum γ (in the pilot study) among γ = 0.1, 0.2, ..., 3.9, 4.0. C45-LP C4.5 with Laplace smoothing and reduced error pruning (%25 confidence). C45-LNP C4.5 with Laplace smoothing and no pruning. SMOTE+ENN A combination of SMOTE (t = 2 and k = 5) and ENN (k = 3). Easy Ensemble A combination of undersampling (T = 4) and Adaboost (si = 10) for i = 1, 2, ..., T.