reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A flexible model-free prediction-based framework for feature ranking

Authors: Jingyi Jessica Li, Yiling Elaine Chen, Xin Tong

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. In Section 5, we use simulation studies to demonstrate the performance of sample-level CC and NPC in ranking low-dimensional and high-dimensional features. In Section 6, we apply sample-level CC and NPC to marginal feature ranking in two real datasets.
Researcher Affiliation	Academia	Jingyi Jessica Li EMAIL Department of Statistics University of California, Los Angeles Yiling Elaine Chen EMAIL Department of Statistics University of California, Los Angeles Xin Tong EMAIL Department of Data Sciences and Operations Marshall Business School University of Southern California
Pseudocode	Yes	To estimate the oracle threshold C αA, we use the NP umbrella algorithm (Tong et al., 2018). Third, for a user-speciﬁed type I error upper bound α (0, 1) and a violation rate δ1 (0, 1), which refers to the probability that the type I error of the trained classiﬁer exceeds α, the algorithm chooses the order k = min k=1,...,m2 (1 α)jαm2 j δ1 Figure H.2: An illustration of the calculation of s-CC and s-NPC.
Open Source Code	Yes	The code for reproducing the numerical results is available at http://doi.org/10.5281/ zenodo.4680067. The R package frc is available at https://github.com/JSB-UCLA/frc.
Open Datasets	Yes	We download the preprocessed and normalized dataset from the Gene Expression Omnibus (GEO) (Edgar et al., 2002) with the accession number GSE60185. The second dataset contains micro RNA (mi RNA) expression levels in urine samples of prostate cancer patients, downloaded from the GEO with accession number GSE86474 (Jeon et al., 2019).
Dataset Splits	Yes	The construction of both s-CC and s-NPC involves splitting the class 0 and class 1 observations. To increase stability, we perform multiple random splits. In detail, we randomly divide S0 for B times into two halves S0(b) ts = n X0(b) 1 , . . . , X0(b) m1 o and S0(b) lo = n X0(b) m1+1, . . . , X0(b) m1+m2 o , where m1 + m2 = m, the subscripts ts and lo stand for train-scoring and left-out respectively, and the superscript b {1, . . . , B} indicates the b-th random split. We also randomly split S1 for B times into S1(b) ts = n X1(b) 1 , . . . , X1(b) n1 o and S1(b) lo = n X1(b) n1+1, . . . , X1(b) n1+n2 o , where n1 + n2 = n and b {1, . . . , B}. In this work, we make equal-sized splits: m1 = m/2 and n1 = n/2 .
Hardware Specification	No	No specific hardware details were provided. The paper generally refers to 'computational resource' but no models, processors, or memory specifications are given for the experimental setup.
Software Dependencies	No	For the implementation of s-NPC and s-CC, we use the kde() function with default arguments in the R package ks. Regarding the RF algorithm, we use the random Forest() function in R package random Forest. We implement using the R function svm() in the e1071 package.
Experiment Setup	Yes	In all the simulation studies, we set the number of random splits B = 11... The number of trees is set to ntree=500 by default. We apply s-CC (8) and s-NPC with δ1 = .05 and four α levels .05, .10, .20, and .30 (11)... Here we set the number of random splits B = 1000 for s-CC and s-NPC, as allowed by our computational resource.