reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fast SVM Training Using Approximate Extreme Points

Authors: Manu Nandan, Pramod P. Khargonekar, Sachin S. Talathi

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive computational experiments on nine data sets compared AESVM to LIBSVM (Chang and Lin, 2011), CVM (Tsang et al., 2005) , BVM (Tsang et al., 2007), LASVM (Bordes et al., 2005), SVMperf (Joachims and Yu, 2009), and the random features method (Rahimi and Recht, 2007). Our AESVM implementation was found to train much faster than the other methods, while its classiﬁcation accuracy was similar to that of LIBSVM in all cases. In particular, for a seizure detection data set, AESVM training was almost 500 times faster than LIBSVM and LASVM and 20 times faster than CVM and BVM. Additionally, AESVM also gave competitively fast classiﬁcation times.
Researcher Affiliation	Collaboration	Manu Nandan EMAIL Department of Computer and Information Science and Engineering University of Florida Gainesville, FL 32611, USA Pramod P. Khargonekar EMAIL Department of Electrical and Computer Engineering University of Florida Gainesville, FL 32611, USA Sachin S. Talathi EMAIL Qualcomm Research Center 5775 Morehouse Dr San Diego, CA 92121, USA
Pseudocode	Yes	Our main contribution is the new AESVM formulation that can be used for fast SVM training. We develop and analyze our technique along the following lines: Algorithmic: The algorithm Derive RS, described in Section 4, computes the representative set in linear time.
Open Source Code	No	The authors will provide the software implementation of AESVM and Derive RS upon request.
Open Datasets	Yes	D1 KDD 99 intrusion detection data set:3 This data set is available for download at http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data. D2 Localization data for person activity:4 This data set has been used in a study on agent-based care for independent living (Kaluˇza et al., 2010). It has N = 164860 data vectors of seven features. D3 Seizure detection data set: This data set has N = 982863 data vectors, three features (D = 3) and density = 100%. ... Details of the data set can be found in Nandan et al. (2010)... D4 Forest cover type data set:5 This data set has N = 581012 data vectors and ﬁfty four features (D = 54) and density = 22%. It is used to classify the forest cover of areas of 30mx30m size into one of seven types. We followed the method used in Collobert et al. (2002), where a classiﬁcation of forest cover type 2 from the other cover types was performed. D5 IJCNN1 data set:6 This data set was used in IJCNN 2001 generalization ability challenge (Chang and Lin, 2001). The training set and testing set have 49990 (N = 49990) and 91701 data vectors respectively. It has 22 features (D = 22) and training set density = 59% D6 Adult income data set:7 This data set derived from the 1994 Census database, was used to classify incomes over $50000 from those below it. The training set has N = 32561 with D = 123 and density = 11%, while the testing set has 16281 data vectors. The data is pre-processed as described in Platt (1999). D7 Epsilon data set:8 This is a data set that was used for 2008 Pascal large scale learning challenge and in Yuan et al. (2011). It is comprised of 400000 data vectors that are 100% dense with D = 2000. D8 MNIST character recognition data set:10 The widely used data set (Lecun et al., 1998) of hand written characters has a training set of N = 60000, D = 780 and density = 19%. We performed the binary classiﬁcation task of classifying the character 0 from the others. The testing set has 10000 data vectors. D9 w8a data set:11 This artiﬁcial data set used in Platt (1999) was randomly generated and has D = 300 features. The training set has N = 49749 with a density = 4% and the testing set has 14951 data vectors.
Dataset Splits	Yes	For data sets D2, D3 and D4, we performed ﬁve fold cross validation. We did not perform ﬁve fold cross-validation on the other data sets, because they have been widely used in their native form with a separate training and testing set.
Hardware Specification	No	No specific hardware details (like CPU/GPU models, memory, or specific computing cluster information) were provided in the paper. The authors only mention 'computational resources' in the acknowledgments.
Software Dependencies	Yes	We compared AESVM to the widely used LIBSVM library, ver. 3.1. ... We focused our experiments on an SMO (Fan et al., 2005) based implementation of AESVM and Derive RS.
Experiment Setup	Yes	To reﬂect a typical SVM training scenario, we performed a grid search with all eighty four combinations of the SVM hyper-parameters C = {2 4, 2 3, ..., 26, 27} and g = {2 4, 2 3, 2 2, ..., 21, 22}. As mentioned earlier, for data sets D2, D3 and D4, ﬁve fold cross-validation was performed. The parameters for Derive RS were P = 105 and V = 103, and the ﬁrst level segregation was performed using FLS2. Three values of the tolerance parameter ϵ were investigated, ϵ = 10 2, 10 3 or 10 4.