reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Noise Accumulation in High Dimensional Classification and Total Signal Index

Authors: Miriam R. Elman, Jessica Minnier, Xiaohui Chang, Dongseok Choi

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We simulate four scenarios with differing amounts of signal strength to evaluate each method. After determining noise accumulation may affect the performance of these classiﬁers, we assess factors that impact it. We conduct simulations by varying sample size, signal strength, signal strength proportional to the number predictors, and signal magnitude with random forest classiﬁers.
Researcher Affiliation	Academia	Miriam R. Elman EMAIL School of Public Health Oregon Health & Science University-Portland State University 3181 SW Sam Jackson Park Rd Portland, OR 97239, USA; Jessica Minnier EMAIL School of Public Health Oregon Health & Science University-Portland State University 3181 SW Sam Jackson Park Rd Portland, OR 97239, USA; Xiaohui Chang EMAIL College of Business Oregon State University 2751 SW Jefferson Way Corvallis, OR 97331, USA; Dongseok Choi EMAIL School of Public Health Oregon Health & Science University-Portland State University 3181 SW Sam Jackson Park Rd Portland, OR 97239, USA
Pseudocode	No	The paper describes methods and simulations in narrative text and refers to existing algorithms (Random Forest, SVM, Boosted Classification Trees) without presenting structured pseudocode or algorithm blocks.
Open Source Code	Yes	Additional information is provided in Appendix and code available on Git Hub (Elman, 2018).
Open Datasets	No	Like Fan et al. (2014), we simulated data for two classes from standard multivariate normal distributions with an identity covariance matrix and p predictors, where µ1 = 0, µ2 was deﬁned to be sparse with m nonzero elements and the remaining entries equal to zero, and n = 100 for each class.
Dataset Splits	No	For each method and scenario, a classiﬁcation rule was developed for q = 2,...,5000 predictors on the training data set. This classiﬁer was then applied to a corresponding test data set and used to predict whether new observations should be categorized into the ﬁrst or second class. This process was repeated 100 times on training data sets then these classiﬁers were used to predict class membership for 100 test data sets.
Hardware Specification	No	All simulations were batch processed in R version 3.4.0 on a computer cluster (R Core Team, 2017). The nodes employed for analyses were running on Cent OS Linux 7.
Software Dependencies	Yes	All simulations were batch processed in R version 3.4.0 on a computer cluster (R Core Team, 2017). The nodes employed for analyses were running on Cent OS Linux 7. PCA was conducted using the prcomp function in base R while random Forest (4.6-12), e1071 (1.6-8), and gbm (2.1.3) packages were used to run RF, SVM, and BCT procedures (Liaw and Wiener, 2002; Meyer et al., 2015; Ridgeway, 2017).
Experiment Setup	Yes	We mostly used the default settings from each package for the simulations (thus neglecting the importance of tuning for these methods).