Sparse Semiparametric Discriminant Analysis for High-dimensional Zero-inflated Data
Authors: Hee Cheol Chung, Yang Ni, Irina Gaynanova
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach using human gut microbiome, breast cancer micro RNA, and singlecell RNA sequencing data, highlighting its superior classification accuracy and robustness to data transformations. From the application perspective, the numerical results on simulated and real data consistently convey that the proposed SEmiparametric Discriminant Analysis (SEDA) method is: 1) always the best-performing method on highly-skewed and zero-inflated data, with a significant margin of error improvement compared to existing approaches. |
| Researcher Affiliation | Academia | Hee Cheol Chung EMAIL Department of Mathematics and Statistics University of North Carolina at Charlotte Charlotte, NC 28223, USA Yang Ni EMAIL Department of Statistics and Data Sciences The University of Texas at Austin Austin, TX 78705, USA Irina Gaynanova EMAIL Department of Biostatistics University of Michigan Ann Arbor, MI 48109, USA |
| Pseudocode | No | The paper describes the methodology and mathematical derivations in detail (Sections 2 and 3) and mentions algorithms like 'coordinate descent algorithm' for solving optimization problems (Section 2.4). However, it does not include any clearly labeled pseudocode or algorithm blocks with structured, step-by-step procedures. |
| Open Source Code | Yes | The R implementing SEDA are available at https://github.com/heech31/SEDA. |
| Open Datasets | Yes | We assess the classification performance of SEDA and competing methods on three sequencing data sets: the Quantitative Microbiome Profiling (QMP) data of Vandeputte et al. (2017), micro RNA data from breast cancer patients available through The Cancer Genome Atlas Project (Cancer Genome Atlas Network, 2012), and single-cell RNA (sc RNA) sequencing data from the 10x Genomics website (https://www.10xgenomics.com). |
| Dataset Splits | Yes | For each model and correlation structure, we consider equally (50:50) and unequally (20:80) proportioned class sizes and fix the sample sizes of training and test data at n = 150 and ntest = 300, respectively. For each data set, we apply the same methods as in Section 4, using 100 random splits into training (4/5) and testing (1/5). |
| Hardware Specification | No | This research were conducted with the advanced computing resources provided by Texas A&M and UNC Charlotte. This statement is too general and does not provide specific hardware details like GPU/CPU models, processor types, or memory specifications. |
| Software Dependencies | No | We consider high-dimensional COpula Discriminant Analysis (CODA) of Han et al. (2013), Negative Binomial Linear Discriminant Analysis (NBLDA) of Dong et al. (2016), Classification and Clustering of Sequencing Data Based on a Poisson Model (Poi Cla Clu) of Witten (2011), Random Forest (RF) of Breiman (2001), Sparse Logistic regression (S-Logistic) of Friedman et al. (2010), Sparse Semiparametric Discriminant Analysis (SSDA) of Mai and Zou (2015), and Sparse Support Vector Machine (S-SVM) of Yi and Huang (2017) using R packages NBLDA (Goksuluk et al., 2022), Poi Claclu (Witten, 2019), randomForest (Liaw and Wiener, 2002), glmnet (Friedman et al., 2010), and sparse SVM (Yi and Zeng, 2018), respectively. The paper lists several R packages used but does not specify their version numbers. |
| Experiment Setup | Yes | Given a fixed tuning parameter λ, we solve (8) with a solver implemented in C in the R package MGSDA (Gaynanova, 2021). For SEDA, both the sparsity tuning parameter λ and the intercept y are selected based on 5-fold cross-validation to minimize the misclassification error rate using a grid of 100 values for each. |