Generalization error bounds for multiclass sparse linear classifiers

Authors: Tomer Levy, Felix Abramovich

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To illustrate the performance of the derived sparse multinomial logistic regression classifiers we applied them to the data set Cancer sites considered in Vincent and Hansen (2014). It consists of bead-based expression data for n = 162 micro RNAs with d = 372 features from L = 18 classes of normal and cancer issue samples. The number of samples in each class ranges from 5 to 26. Vincent and Hansen (2014) used sparse group Lasso classifier for this. We compared the performance of sparse group Slope with λj s and κℓ s of the form given in (14), sparse group Lasso (replicating Vincent and Hansen, 2014), random forest and the well-known gradient boosting trees XGBoost classifiers on the above data set, where we developed the proximal gradient algorithm for solving sparse group Slope in (12) see Appendix D. To remove various technical variations, following Vincent and Hansen (2014), the data was first normalized by centering and scaling the rows of the design matrix, and then standardized by centering and scaling the columns. We split the data into training (75%) and test (25%) sets. The tuning parameters of all classification procedures were chosen by 10-fold cross-validation on the training set, and the misclassification errors of the resulting classifiers were measured on the test set. We repeated the process 10 times, randomly partitioning the data into train and test sets. Table 1 presents the average (over 10 random splits) misclassification errors for the test sets, the numbers of selected features (non-zero rows of the regression coefficients matrix B) and the overall numbers of non-zero coefficients in B. It shows that both sparse multinomial logistic regression classifiers outperform their nonparametric counterparts for this data. Sparse group Slope yielded smaller misclassification errors than sparse group Lasso and, in addition, resulted in much sparser models.
Researcher Affiliation Academia Tomer Levy EMAIL Department of Statistics and Operation Research Tel Aviv University Felix Abramovich EMAIL Department of Statistics and Operation Research Tel Aviv University
Pseudocode Yes Appendix D. Sparse group Slope algorithm The penalized MLE minimization problem in (12) involves a sum of a convex smooth log-likelihood and a convex but non-smooth penalty consisting of two terms. A common approach to solve such optimization problems is by the proximal gradient method (e.g., Beck, 2017). A general proximal operator of a given convex function f is defined as proxf(a) = arg min b 1 2 ||a − b||2 + f(b) . For the setup at hand consider the proximal operator proκ,λ(A) = arg min B 1 2 ||A − B||2 F + ||B||κ,λ where recall that ||B||κ,λ = Pd j=1 λj||B||(j) + Pd j=1 PL l=1 κl|B|j(l) = ||B||λ + Pd j=1 ||Bj ||κ. There exist the efficient proximal gradient descent algorithms for computing proximal operators prox κ and prox λ for κ and λ separately (see respectively Bogdan et al., 2015; Brzyski et al., 2019). We now show that applying prox κ and prox λ consecutively results in prox κ,λ as depicted by Algorithm 1: Algorithm 1: prox κ,λ(A) for j ← 1 . . . d do Uj = prox κ(Aj ) B ← prox λ(U)
Open Source Code No The paper does not contain any explicit statements about code release or links to a code repository.
Open Datasets Yes To illustrate the performance of the derived sparse multinomial logistic regression classifiers we applied them to the data set Cancer sites considered in Vincent and Hansen (2014).
Dataset Splits Yes We split the data into training (75%) and test (25%) sets.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the experiments.
Software Dependencies No The paper mentions using 'random forest' and 'XGBoost classifiers', and solving with a 'proximal gradient algorithm', but does not specify any version numbers for these software components or libraries.
Experiment Setup Yes The tuning parameters of all classification procedures were chosen by 10-fold cross-validation on the training set, and the misclassification errors of the resulting classifiers were measured on the test set. We repeated the process 10 times, randomly partitioning the data into train and test sets.