reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Absent Data Generating Classifier for Imbalanced Class Sizes

Authors: Arash Pourhabib, Bani K. Mallick, Yu Ding

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Implementing the proposed method on a number of simulated and real data sets, we show that our proposed method performs competitively compared to a set of alternative state-of-the-art imbalanced classiﬁcation algorithms.
Researcher Affiliation	Academia	Arash Pourhabib EMAIL School of Industrial Engineering and Management Oklahoma State University 322 Engineering North, Stillwater, Oklahoma 74078-5016, USA Bani K. Mallick EMAIL Department of Statistics Texas A&M University 3143 TAMU, College Station, TX 77843-3143, USA Yu Ding EMAIL Department of Industrial and Systems Engineering Texas A&M University 3131 TAMU, College Station, TX 77843-3131, USA
Pseudocode	Yes	Algorithm 1 Absent Data Generator for Imbalanced Classiﬁcation Given X and X +, evaluate K, M, N, Ki, and M i, for i { , +} and let e X + = X +. repeat
Open Source Code	No	We code ADG, SMOTE, BSMOTE, Under+ENS, and Prob-Fit in MATLAB, and also use the SVM implementation in MATLAB.
Open Datasets	Yes	Four of them are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/), which are the Wisconsin Diagnostic Breast Cancer data set, the Ionosphere data set, the Yeast data set and Speech Recognition data set. The other seven are used in (Wallace and Dahabreh, 2012) (http://www.cebm.brown.edu/static/imbalanced-datasets.zip).
Dataset Splits	Yes	for a given imbalance ratio, we ﬁrst randomly undersample both the majority and the minority data points so that the training data set is constructed with the speciﬁed degree of imbalance... We repeat this procedure ten times and report the average values as the estimated false alarm rate and detection power.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	Yes	To implement KFD we use the MATLAB package Statistical Pattern Recognition Tool (STPRtool) (Franc, 2011).
Experiment Setup	Yes	Based on our experiments, ADG is not very sensitive to the number of absent points k, so that it can be simply set to a number between 10 to 15. The number of actual minority data points generated, q, on the other hand, is decided so that the ﬁnal data set of interest is relatively balanced. In CS-SVM, we choose the value of the so-called box constraint in SVM to be 2l for positive samples and l+l / (2l) for negative samples, so that the cost ratio for the two-class misclassification is l / l+.