Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives

Authors: Antoine Dedieu, Hussein Hazimeh, Rahul Mazumder

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on real and synthetic data demonstrate that our approach leads to models with considerably improved statistical performance (especially variable selection) compared to competing methods.
Researcher Affiliation Academia Antoine Dedieu EMAIL Operations Research Center Massachusetts Institute of Technology Cambridge, MA 02139, USA; Hussein Hazimeh EMAIL Operations Research Center Massachusetts Institute of Technology Cambridge, MA 02139, USA; Rahul Mazumder EMAIL Sloan School of Management Operations Research Center MIT Center for Statistics Massachusetts Institute of Technology Cambridge, MA 02142, USA
Pseudocode Yes Algorithm 1: Cyclic Coordinate Descent (CD); Algorithm 2: CD with Local Combinatorial Search; Algorithm 3: Fast Heuristic for Local Search when m = 1; Algorithm 4: Integrality Generation Algorithm (IGA)
Open Source Code Yes Our CD and local combinatorial optimization algorithms are publicly available through our fast C++/R toolkit L0Learn: on CRAN at https://cran.r-project.org/package=L0Learn and also at https://github.com/hazimehh/L0Learn.
Open Datasets Yes We consider the following three binary classification data sets taken from the NIPS 2003 Feature Selection Challenge (Guyon et al., 2005): Arcene: This data set is used to identify cancer vs non-cancerous patterns in massspectrometric data. ... Dorothea: This data set is used to distinguish active chemical compounds in a drug. ... Dexter: The task here is to identify text documents discussing corporate acquisitions.
Dataset Splits Yes Arcene: ... We used 140 observations for training and 40 observations for testing. Dorothea: ... We used 805 observations for training and 230 observations for testing. Dexter: ... We used 420 observations for training and 120 observations for testing.
Hardware Specification Yes The experiments were carried out on a machine with a 6-core Intel Core i7-8750H processor and 16GB of RAM, running mac OS 10.15.7, R 4.0.3 (with vec Lib s BLAS implementation), and MATLAB R2020b. ... For this experiment, we use a machine with a 12-core Intel Xeon E5 @ 2.7 GHz and 64GB of RAM, running OSX 10.13.6 and Gurobi v8.1.
Software Dependencies Yes The experiments were carried out on a machine with a 6-core Intel Core i7-8750H processor and 16GB of RAM, running mac OS 10.15.7, R 4.0.3 (with vec Lib s BLAS implementation), and MATLAB R2020b. ... We use Gurobi s MIP solver for our experiments. ... running OSX 10.13.6 and Gurobi v8.1.
Experiment Setup Yes We use ℓ0-ℓq as a shorthand to denote the penalty λ0 β 0 + λq β q q for q {1, 2}. For all penalties that involve 2 tuning parameters i.e., ℓ0-ℓq (for q {1, 2}), MCP, Gra SP, and NHTP, we sweep the parameters over a two-dimensional grid. For our penalties, we choose 100 λ0 values as described in Section 2.4. For Gra SP and NHTP, we sweep the number of nonzeros between 1 and 100. For ℓ0-ℓ1, and we choose a sequence of 10 λ1-values in [a, b], where a corresponds to a zero solution and b = 10 4a. Similarly, for ℓ0-ℓ2, Gra SP, and NHTP, we choose 10 λ2 values between 10 4 and 100 for the experiment in Section 5.2; and between 10 8 and 10 4 for that in Section 5.3. For MCP, the sequence of 100 λ values is set to the default values selected by ncvreg, and we vary the second parameter γ over 10 values between 1.5 and 25. For the ℓ1-penalty, the grid of 100 λ values is set to the default sequence chosen by glmnet. ... For all toolkits except NHTP,13 we set the convergence threshold to 10 6 and solve for a regularization path with 100 solutions. For L0Learn, we use a grid of λ0-values varying between λmax 0 (the value which sets all coefficients to zero) and 0.001λmax 0 . For L0Learn and NHTP, we set λ2 = 10 7.