reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Classification of Imbalanced Data with a Geometric Digraph Family

Authors: Artür Manukyan, Elvan Ceyhan

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We assess the classiﬁcation performance of CCCD classiﬁers by extensive Monte Carlo simulations, comparing them with other classiﬁers commonly used in the literature. ... Experiments on both simulated and real data sets indicate that CCCD classiﬁers are robust to the class imbalance problem.
Researcher Affiliation	Academia	Artur Manukyan EMAIL Graduate School of Sciences and Engineering Koc University Sarıyer, 34450, Istanbul, Turkey; Elvan Ceyhan EMAIL Department of Statistics University of Pittsburgh Pittsburgh, 15260, PA, USA. Both authors are affiliated with universities.
Pseudocode	Yes	Algorithm 1 The greedy algorithm for ﬁnding an approximate minimum dominating set of a digraph D. ... Algorithm 2 The greedy algorithm for ﬁnding an approximate minimum cardinality ball cover CX ... Algorithm 3 The greedy algorithm for ﬁnding an approximate minimum dominating set for RW-CCCDs of points Xn from the target class given non-target class points Ym.
Open Source Code	Yes	We employ the cccd, e0171 and RWeka packages in R to classify test data sets with the P-CCCD, SVM (with Gaussian kernel) and C4.5 classiﬁers, respectively (Marchette, 2013; Meyer et al., 2014; R Core Team, 2015). ... Marchette (2013). cccd: Class Cover Catch Digraphs, 2013. URL http://CRAN.R-project.org/package=cccd. R package version 1.04.
Open Datasets	Yes	Finally, we apply all these classiﬁers on several UCI and KEEL data sets. By using the SVDD method of Tax and Duin (2004), we estimated the overlapping ratios of all these data sets. ... UCI Machine Learning and KEEL repositories (Bache and Lichman, 2013; Alcal a-Fdez et al., 2011).
Dataset Splits	Yes	On each replication, we train the data with equal sizes of observations (m = n) from each class for n = 50, 100, 200, 500. On each replication, we record the AUC measures of the classiﬁers on the test data set with 100 observations from each class, resulting a test data set of size 200. ... To test the diﬀerence between the AUC of classiﬁers, we employ the 5x2 cross validation (CV) paired t-test (see Dietterich, 1998) ... For each of ﬁve repetitions, we divide the data into two folds.
Hardware Specification	No	Most of the Monte Carlo simulations presented in this article were executed at Koc University High Performance Computing Laboratory. No specific hardware (e.g., CPU/GPU models, memory) is mentioned.
Software Dependencies	Yes	We employ the cccd, e0171 and RWeka packages in R to classify test data sets with the P-CCCD, SVM (with Gaussian kernel) and C4.5 classiﬁers, respectively (Marchette, 2013; Meyer et al., 2014; R Core Team, 2015). ... Marchette (2013). cccd: Class Cover Catch Digraphs, 2013. URL http://CRAN.R-project.org/package=cccd. R package version 1.04. ... Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2014. URL http://CRAN.R-project.org/package=e1071. R package version 1.6-4.
Experiment Setup	Yes	For each of four classiﬁcation methods other than C4.5, we assign the optimum parameter values which are the best performing values among all considered parameters. For example, an optimum the P-CCCD parameter τ is found in a preliminary (pilot) Monte Carlo simulation study associated with the main simulation setting (i.e., the same setting of the main simulation). ... P-CCCD with the optimum τ (in the pilot study) among τ = 0.0, 0.1, ..., 1.0. RW-CCCD with the optimum e (in the pilot study) among e = 0, 0.1, ..., 1.0. k-NN with optimum k (in the pilot study) among k = 1, 2, ..., 30. SVM with the radial basis function (Gaussian) kernel with the optimum γ (in the pilot study) among γ = 0.1, 0.2, ..., 3.9, 4.0. C45-LP C4.5 with Laplace smoothing and reduced error pruning (%25 conﬁdence). C45-LNP C4.5 with Laplace smoothing and no pruning. SMOTE+ENN A combination of SMOTE (t = 2 and k = 5) and ENN (k = 3). Easy Ensemble A combination of undersampling (T = 4) and Adaboost (si = 10) for i = 1, 2, ..., T.