Noise Accumulation in High Dimensional Classification and Total Signal Index
Authors: Miriam R. Elman, Jessica Minnier, Xiaohui Chang, Dongseok Choi
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We simulate four scenarios with differing amounts of signal strength to evaluate each method. After determining noise accumulation may affect the performance of these classifiers, we assess factors that impact it. We conduct simulations by varying sample size, signal strength, signal strength proportional to the number predictors, and signal magnitude with random forest classifiers. |
| Researcher Affiliation | Academia | Miriam R. Elman EMAIL School of Public Health Oregon Health & Science University-Portland State University 3181 SW Sam Jackson Park Rd Portland, OR 97239, USA; Jessica Minnier EMAIL School of Public Health Oregon Health & Science University-Portland State University 3181 SW Sam Jackson Park Rd Portland, OR 97239, USA; Xiaohui Chang EMAIL College of Business Oregon State University 2751 SW Jefferson Way Corvallis, OR 97331, USA; Dongseok Choi EMAIL School of Public Health Oregon Health & Science University-Portland State University 3181 SW Sam Jackson Park Rd Portland, OR 97239, USA |
| Pseudocode | No | The paper describes methods and simulations in narrative text and refers to existing algorithms (Random Forest, SVM, Boosted Classification Trees) without presenting structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Additional information is provided in Appendix and code available on Git Hub (Elman, 2018). |
| Open Datasets | No | Like Fan et al. (2014), we simulated data for two classes from standard multivariate normal distributions with an identity covariance matrix and p predictors, where µ1 = 0, µ2 was defined to be sparse with m nonzero elements and the remaining entries equal to zero, and n = 100 for each class. |
| Dataset Splits | No | For each method and scenario, a classification rule was developed for q = 2,...,5000 predictors on the training data set. This classifier was then applied to a corresponding test data set and used to predict whether new observations should be categorized into the first or second class. This process was repeated 100 times on training data sets then these classifiers were used to predict class membership for 100 test data sets. |
| Hardware Specification | No | All simulations were batch processed in R version 3.4.0 on a computer cluster (R Core Team, 2017). The nodes employed for analyses were running on Cent OS Linux 7. |
| Software Dependencies | Yes | All simulations were batch processed in R version 3.4.0 on a computer cluster (R Core Team, 2017). The nodes employed for analyses were running on Cent OS Linux 7. PCA was conducted using the prcomp function in base R while random Forest (4.6-12), e1071 (1.6-8), and gbm (2.1.3) packages were used to run RF, SVM, and BCT procedures (Liaw and Wiener, 2002; Meyer et al., 2015; Ridgeway, 2017). |
| Experiment Setup | Yes | We mostly used the default settings from each package for the simulations (thus neglecting the importance of tuning for these methods). |