abess: A Fast Best-Subset Selection Library in Python and R

Authors: Jin Zhu, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, Junxian Zhu

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare abess with popular variable selection libraries in Python and R through regression, classification, and PCA. All computations are conducted on a Ubuntu platform with Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz and 48 RAM. Table 2 displays the regression and classification analysis results, suggesting abess derives parsimonious models that achieve competitive performance in few minutes. Particularly, for the cancer data set, it is more than 20x faster than scikit-learn (ℓ1). The results of the sparse PCA (SPCA) are demonstrated in Table 3.
Researcher Affiliation Academia Jin Zhu1 EMAIL Xueqin Wang2 EMAIL Liyuan Hu1 EMAIL Junhao Huang1 EMAIL Kangkang Jiang1 EMAIL Yanhang Zhang3 EMAIL Shiyun Lin4 EMAIL Junxian Zhu5 EMAIL 1 Department of Statistical Science, Sun Yat-Sen University, Guangzhou, GD, China 2 Department of Statistics and Finance/International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, Anhui, China 3 School of Statistics, Renmin University of China, Beijing, China 4 Center for Statistical Science, Peking University, Beijing, China 5 Saw Swee Hock School of Public Health, National University of Singapore, Singapore
Pseudocode No The paper includes code snippets in Figure 2 and Figure 3, but these are complete executable code examples (R and Python) rather than structured pseudocode or algorithm blocks describing a method in a generic, language-agnostic way.
Open Source Code Yes The core of the library is programmed in C++. For ease of use, a Python library is designed for convenient integration with scikit-learn, and it can be installed from the Python Package Index (Py PI). In addition, a user-friendly R library is available at the Comprehensive R Archive Network (CRAN). The source code is available at: https://github.com/abess-team/abess.
Open Datasets Yes We are grateful to UCI Machine Learning Repository for sharing the superconductivity and musk data sets. Table 2: Average performance on the superconductivity data set (for regression), the cancer and the musk data sets (for classification) (Chin et al., 2006; Dua and Graff, 2017; Hamidieh, 2018) based on 20 randomly drawn test sets. The data set has 217 observations, each of which has 1,413 genetic factors (Christensen et al., 2009).
Dataset Splits Yes Table 2: Average performance on the superconductivity data set (for regression), the cancer and the musk data sets (for classification) (Chin et al., 2006; Dua and Graff, 2017; Hamidieh, 2018) based on 20 randomly drawn test sets. Figure 3 also shows an example using GridSearchCV with cv=5: `grid_search = GridSearchCV(pipe, param_grid, scoring=scorer, cv=5)`
Hardware Specification Yes All computations are conducted on a Ubuntu platform with Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz and 48 RAM.
Software Dependencies Yes abess can run on most Linux distributions, Windows 32 or 64-bit, and mac OS with Python (version 3.6) or R (version 3.1.0), and can be easily installed from Py PI1 and CRAN2. Python version is 3.9.1 and R version is 3.6.3. Library Version scikit-learn (ℓ1) 1.0.0, celer 0.6.1, elasticnet 1.3.0.
Experiment Setup Yes Figure 3 illustrates the integration of the abess Python interface with scikit-learn s modules to build a non-linear model for diagnosing malignant tumors. The code block shows specific parameters for `PolynomialFeatures` (`include_bias=False`, `degree:[1, 2, 3]`, `interaction_only:[True, False]`) and `GridSearchCV` (`cv=5`).