Towards Ultrahigh Dimensional Feature Selection for Big Data
Authors: Mingkui Tan, Ivor W. Tsang, Li Wang
JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with O(1014) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training efficiency. |
| Researcher Affiliation | Academia | Mingkui Tan EMAIL School of Computer Engineering Nanyang Technological University Blk N4, #B1a-02, Nanyang Avenue 639798, Singapore Ivor W. Tsang EMAIL Center for Quantum Computation & Intelligent Systems University of Technology Sydney, Sydney P.O. Box 123, Broadway, NSW 2007 Sydney, Australia Li Wang EMAIL Department of Mathematics, University of California San Diego 9500 Gilman Drive La Jolla, CA 92093, USA |
| Pseudocode | Yes | Algorithm 1 Cutting Plane Algorithm for Solving (13). Algorithm 2 Algorithm for Worst-Case Analysis. Algorithm 3 Moreau Projection Sτ(u, v). Algorithm 4 Accelerated Proximal Gradient for Solving Problem (22) (Inner Iterations). Algorithm 5 Incremental Implementation of Algorithm 2 for Ultrahigh Dimensional Data. |
| Open Source Code | Yes | 1. The C++ and MATLAB source codes of the proposed methods are publicly available at http://www.tanmingkui.com/fgm.html. |
| Open Datasets | Yes | 5. Among these data sets, epsilon, real-sim, rcv1.binary, news20.binary and kddb can be downloaded at http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/, aut-avn can be downloaded at http://vikas.sindhwani.org/svmlin.html and Arxiv astro-ph is from Joachims (2006). 13. The data set is available from http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. |
| Dataset Splits | Yes | Among them, epsilon, Arxiv astro-ph, rcv1.binary and kddb data sets have been split into training set and testing set. For real-sim, aut-avn and news20.binary, we randomly split them into training and testing sets, as shown in Table 1. |
| Hardware Specification | Yes | All the comparisons are performed on a 2.27GHZ Intel(R)Core(TM) 4 DUO CPU running windows sever 2003 with 24.0GB main memory. |
| Software Dependencies | No | All the methods are implemented in C++. 7. Sources are available at http://www.csie.ntu.edu.tw/ cjlin/liblinear/. 9. Sources are available at http://www.public.asu.edu/ jye02/Software/SLEP/index.htm. The paper mentions specific software like C++, LIBLinear, SLEP, and MATLAB but does not provide specific version numbers for these, which is necessary for reproducibility. |
| Experiment Setup | Yes | For FGM, we study FGM with Simple MKL solver (denoted by MKL-FGM) (Tan et al., 2010), FGM with APG method for the squared hinge loss (denoted by PROX-FGM) and the logistic loss (denoted by PROX-SLR), respectively. ... In SGD-SLR, there are three important parameters, namely λ1 to penalize ||w||1, λ2 to penalize ||w||2 2, and the stopping criterion min.dgap. Suggested by the package, in the following experiment, we fix λ2 = 1e-4 and min.dgap=1e-5, and change λ1 to obtain different levels of sparsity. ... We set C = 10 and test different B s from {10, 30, 50}. ... we vary C [0.001, 0.007] for l1-SVM, C [5e-3, 4e-2] for l1-LR and λ1 [7.2e-4, 2.5e-3] for SGD-SLR. ... we test C {0.5, 5, 50, 500}. ... we test two values of ϵc, namely ϵc = 0.005 and ϵc = 0.001. ... we vary C to select different number of groups under the stopping tolerance ϵc = 0.001. For each C, we test B {2, 5, 8, 10}. The tradeoffparameter λ in SLEP is chosen from [0, 1]... Specifically, we set λ in [0.002, 0.700] for FISTA and ACTIVE, and set λ in [0.003, 0.1] for BCD. |