reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RaSE: Random Subspace Ensemble Classification

Authors: Ye Tian, Yang Feng

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	An array of simulations under various models and real-data applications demonstrate the eﬀectiveness and robustness of the Ra SE classiﬁer and its iterative version in terms of low misclassiﬁcation rate and accurate feature ranking.
Researcher Affiliation	Academia	Ye Tian EMAIL Department of Statistics Columbia University New York, NY 10027, USA Yang Feng EMAIL Department of Biostatistics, School of Global Public Health New York University New York, NY 10003, USA
Pseudocode	Yes	Algorithm 1: Random subspace ensemble classiﬁcation (Ra SE) Algorithm 2: Iterative Ra SE (Ra SET )
Open Source Code	Yes	The Ra SE algorithm is implemented in the R package Ra SEn on CRAN. 6.1 Summary In this work, we introduce a ﬂexible ensemble classiﬁcation framework named Ra SE, which is designed to solve the sparse classiﬁcation problem. To select the optimal subspace for each weak learner, we deﬁne a new information criterion, ratio information criterion (RIC), based on Kullback-Leibler divergence, and it is shown to achieve screening consistency and weak consistency under some general model conditions. This guarantees that each weak learner is trained using features in the minimal discriminative set with a high probability for a suﬃciently large sample size and the number of random subspaces. We also investigate the consistency of RIC for LDA and QDA models under some speciﬁc conditions. The theoretical analysis of Ra SE classiﬁers with speciﬁc base classiﬁers is conducted. In addition, Tian and Feng we present two versions of Ra SE algorithms, that is, the vanilla version (Ra SE) and the iterative version (Ra SET ). We also apply Ra SE for feature ranking based on the average selected percentage of features in subspaces. Theoretical analysis shows that when the stepwise detectable condition holds and the signal is suﬃciently sparse, the iterative Ra SE can cover the minimal discriminative set with high probability after a few iterations, with the required number of random subspaces smaller than that for the vanilla Ra SE. Multiple numerical experiments, including simulations and real data, verify that Ra SE is a favorable classiﬁcation method for sparse classiﬁcation problems. The Ra SE algorithms are available in R package Ra SEn (https://cran.r-project. org/web/packages/Ra SEn/index.html).
Open Datasets	Yes	The madelon data set (http://archive.ics.uci.edu/ml/datasets/madelon) is an artiﬁcial data set containing data points grouped in 32 clusters placed on the vertices of a ﬁve-dimensional hypercube and randomly labeled as 0 or 1 (Guyon et al., 2005). The musk data set (https://archive.ics.uci.edu/ml/datasets/Musk+(Version+2)) contains 6598 observations with 5581 non-musk (class 0) and 1017 musk (class 1) molecules. The mice protein expression data set (https://archive.ics.uci.edu/ml/datasets/Mice+ Protein+Expression) contains 1080 instances with 570 healthy mice (class 0) and 510 mice Tian and Feng with Down s syndrome (class 1). There are 77 features representing the expression of 77 different proteins (Higuera et al., 2015). The hand-written digits recognition data set (https://archive.ics.uci.edu/ml/datasets/ Multiple+Features) consists of features of hand-written numerals (0-9) extracted from a collection of Dutch utility maps (Dua and Graﬀ, 2019).
Dataset Splits	Yes	Here p = 400, and the training sample size n {200, 400, 1000}. Test data of size 1000 is independently generated from the same model. The training sample size is set as n {200, 500, 1000} for three diﬀerent settings, and each time the remained data is used as test data. The training sample size is set to be 200, 500, 1000, for each setting, and the remaining observations are used as the test data. Training samples of size 200, 500, 800 are considered, and the remaining observations are set as the test data. There are 400 observations, 200 of which belong to class 0, and the remaining 200 belong to class 1. The training samples of size 50, 100, 200 are used, and the remained data is used as the test data.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory used for running the experiments. It only discusses computational complexity.
Software Dependencies	No	The standard LDA and QDA methods are implemented by using R package MASS. And the k NN classiﬁer is implemented by knn, knn.cv in R package class and function knn3 in R package caret. We utilize package dsda to ﬁt s LDA model. RAMP is implemented through package RAMP. For the RF, we use R package Random Forest; the number of trees are set as 500 (as default) and [ p] variables are randomly selected when training each tree (as default). And the NSC model is ﬁtted by calling function pamr.train in package pamr. RPEnsemble is implemented by R package RPEnsemble. To obtain the MLE of parameters in the Gamma distribution, the Newton s method is applied via function nlm in package stats and the initial point is chosen to be the moment estimator. To get the non-parametric estimate of KL divergence and RIC, we call function KL.divergence in package FNN. The paper lists several R packages used but does not specify their version numbers.
Experiment Setup	Yes	In our implementation, we set B1 = 200 and B2 = 500 as default. For LDA and QDA classiﬁer, C is set to choose the optimal subspace by minimizing the RIC, while for k NN, the default setting is minimizing the leave-one-out cross-validation error. Without prior information about the features, as we mentioned in Section 2.1, D is set as the hierarchical uniform distribution over the subspaces. To generate the size d of subspaces from the uniform distribution over {1, . . . , D}, another parameter D has to be determined. In practice, for QDA base classiﬁer we set D = min(p, [ n0], [ n1]) and for LDA, k NN and all the other base classiﬁers, we set D = min(p, [ n]), where [a] denotes the largest integer not larger than a. The threshold ˆα is chosen by (2) to minimize the training error. When using non-parametric estimate of RIC corresponding to (6), following Wang et al. (2009) and Ganguly et al. (2018), we set k0 = [ n0] and k1 = [ n1] to satisfy the conditions they presented for proving the consistency. For the RF, we use R package Random Forest; the number of trees are set as 500 (as default) and [ p] variables are randomly selected when training each tree (as default). When ﬁtting the standard k NN classiﬁer, and the k NN base classiﬁer in RPEnsemble and Ra SE method, the number of neighbors k is chosen from {3, 5, 7, 9, 11} via leave-one-out cross-validation, following Cannings and Samworth (2017).