reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse Semiparametric Discriminant Analysis for High-dimensional Zero-inflated Data

Authors: Hee Cheol Chung, Yang Ni, Irina Gaynanova

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach using human gut microbiome, breast cancer micro RNA, and singlecell RNA sequencing data, highlighting its superior classiﬁcation accuracy and robustness to data transformations. From the application perspective, the numerical results on simulated and real data consistently convey that the proposed SEmiparametric Discriminant Analysis (SEDA) method is: 1) always the best-performing method on highly-skewed and zero-inﬂated data, with a signiﬁcant margin of error improvement compared to existing approaches.
Researcher Affiliation	Academia	Hee Cheol Chung EMAIL Department of Mathematics and Statistics University of North Carolina at Charlotte Charlotte, NC 28223, USA Yang Ni EMAIL Department of Statistics and Data Sciences The University of Texas at Austin Austin, TX 78705, USA Irina Gaynanova EMAIL Department of Biostatistics University of Michigan Ann Arbor, MI 48109, USA
Pseudocode	No	The paper describes the methodology and mathematical derivations in detail (Sections 2 and 3) and mentions algorithms like 'coordinate descent algorithm' for solving optimization problems (Section 2.4). However, it does not include any clearly labeled pseudocode or algorithm blocks with structured, step-by-step procedures.
Open Source Code	Yes	The R implementing SEDA are available at https://github.com/heech31/SEDA.
Open Datasets	Yes	We assess the classiﬁcation performance of SEDA and competing methods on three sequencing data sets: the Quantitative Microbiome Proﬁling (QMP) data of Vandeputte et al. (2017), micro RNA data from breast cancer patients available through The Cancer Genome Atlas Project (Cancer Genome Atlas Network, 2012), and single-cell RNA (sc RNA) sequencing data from the 10x Genomics website (https://www.10xgenomics.com).
Dataset Splits	Yes	For each model and correlation structure, we consider equally (50:50) and unequally (20:80) proportioned class sizes and ﬁx the sample sizes of training and test data at n = 150 and ntest = 300, respectively. For each data set, we apply the same methods as in Section 4, using 100 random splits into training (4/5) and testing (1/5).
Hardware Specification	No	This research were conducted with the advanced computing resources provided by Texas A&M and UNC Charlotte. This statement is too general and does not provide specific hardware details like GPU/CPU models, processor types, or memory specifications.
Software Dependencies	No	We consider high-dimensional COpula Discriminant Analysis (CODA) of Han et al. (2013), Negative Binomial Linear Discriminant Analysis (NBLDA) of Dong et al. (2016), Classiﬁcation and Clustering of Sequencing Data Based on a Poisson Model (Poi Cla Clu) of Witten (2011), Random Forest (RF) of Breiman (2001), Sparse Logistic regression (S-Logistic) of Friedman et al. (2010), Sparse Semiparametric Discriminant Analysis (SSDA) of Mai and Zou (2015), and Sparse Support Vector Machine (S-SVM) of Yi and Huang (2017) using R packages NBLDA (Goksuluk et al., 2022), Poi Claclu (Witten, 2019), randomForest (Liaw and Wiener, 2002), glmnet (Friedman et al., 2010), and sparse SVM (Yi and Zeng, 2018), respectively. The paper lists several R packages used but does not specify their version numbers.
Experiment Setup	Yes	Given a ﬁxed tuning parameter λ, we solve (8) with a solver implemented in C in the R package MGSDA (Gaynanova, 2021). For SEDA, both the sparsity tuning parameter λ and the intercept y are selected based on 5-fold cross-validation to minimize the misclassiﬁcation error rate using a grid of 100 values for each.