On Generalizations of Some Distance Based Classifiers for HDLSS Data
Authors: Sarbojit Roy, Soham Sarkar, Subhajit Dutta, Anil K. Ghosh
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | High-dimensional behavior of the proposed classifiers is studied theoretically. Numerical experiments with a variety of simulated examples as well as an extensive analysis of benchmark data sets from three different databases exhibit advantages of the proposed methods. |
| Researcher Affiliation | Academia | Sarbojit Roy EMAIL Department of Mathematics and Statistics IIT Kanpur Kanpur 208016, India. Soham Sarkar EMAIL Institut de Math ematiques Ecole Polytechnique F ed erale de Lausanne 1015 Lausanne, Switzerland. Subhajit Dutta EMAIL Department of Mathematics and Statistics IIT Kanpur Kanpur 208016, India. Anil K. Ghosh EMAIL Theoretical Statistics and Mathematics Unit Indian Statistical Institute Kolkata 700108, India. |
| Pseudocode | No | The paper describes methods mathematically and in text, but no explicit pseudocode blocks or algorithm listings are provided. |
| Open Source Code | Yes | Our classifiers were implemented in R too, and the codes are available from this link. |
| Open Datasets | Yes | Numerical experiments with a variety of simulated examples as well as an extensive analysis of benchmark data sets from three different databases exhibit advantages of the proposed methods. The Cricket X and EOGHorizontal Signal data sets are both 12 class problems from the UCR Time Series Classification Archive (see Dau et al., 2018)... The GSE2685 data set (available at the Microarray database: http: //www.biolab.si/supp/bi-cancer/projections/) ... In the nutt2003v2 data set (available at the Compcancer database: https://schlieplab.org/Static/Supplements/Comp Cancer/datasets.htm) |
| Dataset Splits | Yes | For each example, we generated 50 observations from each class to form the training sample. Misclassification rates of different classifiers are computed based on a test set consisting of 500 (250 from each class) observations. In Example 4, the training samples sizes were set to be 50 and 25, respectively. For our analysis of the data sets in the Compcancer and Microarray databases, we randomly selected 50% of the observations (without replacement) corresponding to each class to form the training set. The rest of the observations were considered as test cases. For data sets from the UCR Archive, we combined the available training and test data, and randomly selected 50% of the observations from the combined set to form a new set of training observations, while keeping the proportions of observations from different classes consistent. The other half was considered as the test set. This procedure was repeated 100 times over different splits of the data set to obtain a stable estimate of the misclassification rate. |
| Hardware Specification | No | The paper discusses computational experiments and simulations but does not specify any particular hardware used for these computations. |
| Software Dependencies | No | The R packages e1071, glmnet, RSNNS and Rand Pro were used for SVM, GLMNET, NNET and NN-RAND, respectively. Our classifiers were implemented in R too, and the codes are available from this link. The paper mentions software by name but does not provide specific version numbers for these packages or for R itself, which is required for reproducibility. |
| Experiment Setup | Yes | In each example, we simulated data for d = 50, 100, 250, 500 and 1000. The training sample was formed by generating 50 observations from each class (except Example 4) and a test set of size 500 (250 from each class) was used. In Example 4, the training samples sizes were set to be 50 and 25, respectively. This process was repeated 100 times to compute the average misclassification rates, which are reported in Figure 8. For the proposed generalized and block-generalized classifiers, we used γ(t) = 1 e t and φ(t) = t. We used the radial basis function (RBF) kernel, i.e., Kθ(x, y) = exp{ θ x y 2} in non-linear SVM with θ {i/10d; 1 i 20} and reported the minimum misclassification rate. For NNET, we used the sigmoid as its activation function. The number of hidden layers were allowed to vary in the set {1, 3, 5, 10}, and the minimum misclassification rate was reported as NNET. We have used default values for the other parameters that were involved with these classifiers. |