Multi-scale Classification using Localized Spatial Depth
Authors: Subhajit Dutta, Soham Sarkar, Anil K. Ghosh
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed classifier can be conveniently used even when the dimension of the data is larger than the sample size, and its good discriminatory power for such data has been established using theoretical as well as numerical results. In Sections 5 and 6, some simulated and benchmark data sets are analyzed to demonstrate the usefulness of these proposed classifiers. |
| Researcher Affiliation | Academia | Subhajit Dutta EMAIL Department of Mathematics and Statistics Indian Institute of Technology Kanpur 208016, India. Soham Sarkar EMAIL Anil K. Ghosh EMAIL Theoretical Statistics and Mathematics Unit Indian Statistical Institute 203, B. T. Road, Kolkata 700108, India. |
| Pseudocode | No | No structured pseudocode or algorithm blocks are explicitly present in the paper. The methodology is described using mathematical formulations and textual descriptions. |
| Open Source Code | Yes | For the classifiers based on SPD and LSPD, we wrote our own R codes and they are available at the link goo.gl/E5tmd6. |
| Open Datasets | Yes | The biomedical data set is taken from the CMU data archive (http://lib.stat.cmu.edu/datasets/). The diabetes data set is available in the R library mclust (also analyzed in Reaven and Miller, 1979). All other data are taken from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). lightning-2 data and colon data (Alon et al., 1999). The first data set is from the UCR time series classification archive (http://www.cs.ucr.edu/~eamonn/time_series_data/), while the other one is taken from the R library rda. |
| Dataset Splits | Yes | In each example, taking an equal number of observations from each of the two competing classes, we generated training and test sets of sizes 200 and 500, respectively. This procedure was repeated 500 times, and the average test set misclassification rates of different classifiers are reported in Tables 1 and 2 along with their corresponding standard errors. Throughout this article, we have used 50 different values of h for multi-scale classification based on LSPD, and the weight function is computed using 5-fold cross-validation method. |
| Hardware Specification | Yes | All the calculations were done on a desktop computer with an Intel i7 (2.2 GHz) processor having 8 GB RAM. |
| Software Dependencies | Yes | We used the codes available at the R library e1071 (Dimitriadou et al., 2011). For the implementation of TREE and RF, we used the R codes available in the libraries tree (Ripley, 2011) and random Forest (Liaw and Wiener, 2002), respectively. For the maximum LD classifier, we used the R library Depth Proc (Kosiorowski and Zawadzki, 2016) and the R library VGAM (Yee, 2008) was used to fit GAM. |
| Experiment Setup | Yes | For the multi-scale method based on KDE, we have considered 50 equi-spaced values of the bandwidth in the range suggested by Ghosh et al. (2006). For the multi-scale version of k-NN, we considered all possible values of k (see Ghosh et al., 2005, for more details). For the RBF kernel, it has been suggested in the literature to use γ = 1/d (see http://www.csie.ntu.edu.tw/~cjlin/libsvm/). However, for our numerical work, we considered γ = i/10d for 1 i 50. We also used 25 different values for the box constraint in the interval [0.1, 100], which were equi-spaced in the logarithmic scale. For classification tree, the deviance function was used as a measure of impurity, and the maximum height of the tree was restricted to 31. Nodes with less than 5 observations were never considered for splitting. We have combined the results of 500 trees in RF, where each tree was generated based on 63.2% randomly chosen observations from the training sample. At any stage, only a random subset of √d out of d variables were considered for splitting. |