reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Ultrahigh Dimensional Feature Selection for Big Data

Authors: Mingkui Tan, Ivor W. Tsang, Li Wang

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with O(1014) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training eﬃciency.
Researcher Affiliation	Academia	Mingkui Tan EMAIL School of Computer Engineering Nanyang Technological University Blk N4, #B1a-02, Nanyang Avenue 639798, Singapore Ivor W. Tsang EMAIL Center for Quantum Computation & Intelligent Systems University of Technology Sydney, Sydney P.O. Box 123, Broadway, NSW 2007 Sydney, Australia Li Wang EMAIL Department of Mathematics, University of California San Diego 9500 Gilman Drive La Jolla, CA 92093, USA
Pseudocode	Yes	Algorithm 1 Cutting Plane Algorithm for Solving (13). Algorithm 2 Algorithm for Worst-Case Analysis. Algorithm 3 Moreau Projection Sτ(u, v). Algorithm 4 Accelerated Proximal Gradient for Solving Problem (22) (Inner Iterations). Algorithm 5 Incremental Implementation of Algorithm 2 for Ultrahigh Dimensional Data.
Open Source Code	Yes	1. The C++ and MATLAB source codes of the proposed methods are publicly available at http://www.tanmingkui.com/fgm.html.
Open Datasets	Yes	5. Among these data sets, epsilon, real-sim, rcv1.binary, news20.binary and kddb can be downloaded at http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/, aut-avn can be downloaded at http://vikas.sindhwani.org/svmlin.html and Arxiv astro-ph is from Joachims (2006). 13. The data set is available from http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/.
Dataset Splits	Yes	Among them, epsilon, Arxiv astro-ph, rcv1.binary and kddb data sets have been split into training set and testing set. For real-sim, aut-avn and news20.binary, we randomly split them into training and testing sets, as shown in Table 1.
Hardware Specification	Yes	All the comparisons are performed on a 2.27GHZ Intel(R)Core(TM) 4 DUO CPU running windows sever 2003 with 24.0GB main memory.
Software Dependencies	No	All the methods are implemented in C++. 7. Sources are available at http://www.csie.ntu.edu.tw/ cjlin/liblinear/. 9. Sources are available at http://www.public.asu.edu/ jye02/Software/SLEP/index.htm. The paper mentions specific software like C++, LIBLinear, SLEP, and MATLAB but does not provide specific version numbers for these, which is necessary for reproducibility.
Experiment Setup	Yes	For FGM, we study FGM with Simple MKL solver (denoted by MKL-FGM) (Tan et al., 2010), FGM with APG method for the squared hinge loss (denoted by PROX-FGM) and the logistic loss (denoted by PROX-SLR), respectively. ... In SGD-SLR, there are three important parameters, namely λ1 to penalize \|\|w\|\|1, λ2 to penalize \|\|w\|\|2 2, and the stopping criterion min.dgap. Suggested by the package, in the following experiment, we ﬁx λ2 = 1e-4 and min.dgap=1e-5, and change λ1 to obtain diﬀerent levels of sparsity. ... We set C = 10 and test diﬀerent B s from {10, 30, 50}. ... we vary C [0.001, 0.007] for l1-SVM, C [5e-3, 4e-2] for l1-LR and λ1 [7.2e-4, 2.5e-3] for SGD-SLR. ... we test C {0.5, 5, 50, 500}. ... we test two values of ϵc, namely ϵc = 0.005 and ϵc = 0.001. ... we vary C to select diﬀerent number of groups under the stopping tolerance ϵc = 0.001. For each C, we test B {2, 5, 8, 10}. The tradeoﬀparameter λ in SLEP is chosen from [0, 1]... Speciﬁcally, we set λ in [0.002, 0.700] for FISTA and ACTIVE, and set λ in [0.003, 0.1] for BCD.