Absent Data Generating Classifier for Imbalanced Class Sizes

Authors: Arash Pourhabib, Bani K. Mallick, Yu Ding

JMLR 2015 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Implementing the proposed method on a number of simulated and real data sets, we show that our proposed method performs competitively compared to a set of alternative state-of-the-art imbalanced classification algorithms.
Researcher Affiliation Academia Arash Pourhabib EMAIL School of Industrial Engineering and Management Oklahoma State University 322 Engineering North, Stillwater, Oklahoma 74078-5016, USA Bani K. Mallick EMAIL Department of Statistics Texas A&M University 3143 TAMU, College Station, TX 77843-3143, USA Yu Ding EMAIL Department of Industrial and Systems Engineering Texas A&M University 3131 TAMU, College Station, TX 77843-3131, USA
Pseudocode Yes Algorithm 1 Absent Data Generator for Imbalanced Classification Given X and X +, evaluate K, M, N, Ki, and M i, for i { , +} and let e X + = X +. repeat
Open Source Code No We code ADG, SMOTE, BSMOTE, Under+ENS, and Prob-Fit in MATLAB, and also use the SVM implementation in MATLAB.
Open Datasets Yes Four of them are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/), which are the Wisconsin Diagnostic Breast Cancer data set, the Ionosphere data set, the Yeast data set and Speech Recognition data set. The other seven are used in (Wallace and Dahabreh, 2012) (http://www.cebm.brown.edu/static/imbalanced-datasets.zip).
Dataset Splits Yes for a given imbalance ratio, we first randomly undersample both the majority and the minority data points so that the training data set is constructed with the specified degree of imbalance... We repeat this procedure ten times and report the average values as the estimated false alarm rate and detection power.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies Yes To implement KFD we use the MATLAB package Statistical Pattern Recognition Tool (STPRtool) (Franc, 2011).
Experiment Setup Yes Based on our experiments, ADG is not very sensitive to the number of absent points k, so that it can be simply set to a number between 10 to 15. The number of actual minority data points generated, q, on the other hand, is decided so that the final data set of interest is relatively balanced. In CS-SVM, we choose the value of the so-called box constraint in SVM to be 2l for positive samples and l+l / (2l) for negative samples, so that the cost ratio for the two-class misclassification is l / l+.