A flexible model-free prediction-based framework for feature ranking

Authors: Jingyi Jessica Li, Yiling Elaine Chen, Xin Tong

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. In Section 5, we use simulation studies to demonstrate the performance of sample-level CC and NPC in ranking low-dimensional and high-dimensional features. In Section 6, we apply sample-level CC and NPC to marginal feature ranking in two real datasets.
Researcher Affiliation Academia Jingyi Jessica Li EMAIL Department of Statistics University of California, Los Angeles Yiling Elaine Chen EMAIL Department of Statistics University of California, Los Angeles Xin Tong EMAIL Department of Data Sciences and Operations Marshall Business School University of Southern California
Pseudocode Yes To estimate the oracle threshold C αA, we use the NP umbrella algorithm (Tong et al., 2018). Third, for a user-specified type I error upper bound α (0, 1) and a violation rate δ1 (0, 1), which refers to the probability that the type I error of the trained classifier exceeds α, the algorithm chooses the order k = min k=1,...,m2 (1 α)jαm2 j δ1 Figure H.2: An illustration of the calculation of s-CC and s-NPC.
Open Source Code Yes The code for reproducing the numerical results is available at http://doi.org/10.5281/ zenodo.4680067. The R package frc is available at https://github.com/JSB-UCLA/frc.
Open Datasets Yes We download the preprocessed and normalized dataset from the Gene Expression Omnibus (GEO) (Edgar et al., 2002) with the accession number GSE60185. The second dataset contains micro RNA (mi RNA) expression levels in urine samples of prostate cancer patients, downloaded from the GEO with accession number GSE86474 (Jeon et al., 2019).
Dataset Splits Yes The construction of both s-CC and s-NPC involves splitting the class 0 and class 1 observations. To increase stability, we perform multiple random splits. In detail, we randomly divide S0 for B times into two halves S0(b) ts = n X0(b) 1 , . . . , X0(b) m1 o and S0(b) lo = n X0(b) m1+1, . . . , X0(b) m1+m2 o , where m1 + m2 = m, the subscripts ts and lo stand for train-scoring and left-out respectively, and the superscript b {1, . . . , B} indicates the b-th random split. We also randomly split S1 for B times into S1(b) ts = n X1(b) 1 , . . . , X1(b) n1 o and S1(b) lo = n X1(b) n1+1, . . . , X1(b) n1+n2 o , where n1 + n2 = n and b {1, . . . , B}. In this work, we make equal-sized splits: m1 = m/2 and n1 = n/2 .
Hardware Specification No No specific hardware details were provided. The paper generally refers to 'computational resource' but no models, processors, or memory specifications are given for the experimental setup.
Software Dependencies No For the implementation of s-NPC and s-CC, we use the kde() function with default arguments in the R package ks. Regarding the RF algorithm, we use the random Forest() function in R package random Forest. We implement using the R function svm() in the e1071 package.
Experiment Setup Yes In all the simulation studies, we set the number of random splits B = 11... The number of trees is set to ntree=500 by default. We apply s-CC (8) and s-NPC with δ1 = .05 and four α levels .05, .10, .20, and .30 (11)... Here we set the number of random splits B = 1000 for s-CC and s-NPC, as allowed by our computational resource.