reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-scale Classification using Localized Spatial Depth

Authors: Subhajit Dutta, Soham Sarkar, Anil K. Ghosh

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed classiﬁer can be conveniently used even when the dimension of the data is larger than the sample size, and its good discriminatory power for such data has been established using theoretical as well as numerical results. In Sections 5 and 6, some simulated and benchmark data sets are analyzed to demonstrate the usefulness of these proposed classiﬁers.
Researcher Affiliation	Academia	Subhajit Dutta EMAIL Department of Mathematics and Statistics Indian Institute of Technology Kanpur 208016, India. Soham Sarkar EMAIL Anil K. Ghosh EMAIL Theoretical Statistics and Mathematics Unit Indian Statistical Institute 203, B. T. Road, Kolkata 700108, India.
Pseudocode	No	No structured pseudocode or algorithm blocks are explicitly present in the paper. The methodology is described using mathematical formulations and textual descriptions.
Open Source Code	Yes	For the classiﬁers based on SPD and LSPD, we wrote our own R codes and they are available at the link goo.gl/E5tmd6.
Open Datasets	Yes	The biomedical data set is taken from the CMU data archive (http://lib.stat.cmu.edu/datasets/). The diabetes data set is available in the R library mclust (also analyzed in Reaven and Miller, 1979). All other data are taken from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). lightning-2 data and colon data (Alon et al., 1999). The ﬁrst data set is from the UCR time series classiﬁcation archive (http://www.cs.ucr.edu/~eamonn/time_series_data/), while the other one is taken from the R library rda.
Dataset Splits	Yes	In each example, taking an equal number of observations from each of the two competing classes, we generated training and test sets of sizes 200 and 500, respectively. This procedure was repeated 500 times, and the average test set misclassiﬁcation rates of diﬀerent classiﬁers are reported in Tables 1 and 2 along with their corresponding standard errors. Throughout this article, we have used 50 diﬀerent values of h for multi-scale classiﬁcation based on LSPD, and the weight function is computed using 5-fold cross-validation method.
Hardware Specification	Yes	All the calculations were done on a desktop computer with an Intel i7 (2.2 GHz) processor having 8 GB RAM.
Software Dependencies	Yes	We used the codes available at the R library e1071 (Dimitriadou et al., 2011). For the implementation of TREE and RF, we used the R codes available in the libraries tree (Ripley, 2011) and random Forest (Liaw and Wiener, 2002), respectively. For the maximum LD classiﬁer, we used the R library Depth Proc (Kosiorowski and Zawadzki, 2016) and the R library VGAM (Yee, 2008) was used to ﬁt GAM.
Experiment Setup	Yes	For the multi-scale method based on KDE, we have considered 50 equi-spaced values of the bandwidth in the range suggested by Ghosh et al. (2006). For the multi-scale version of k-NN, we considered all possible values of k (see Ghosh et al., 2005, for more details). For the RBF kernel, it has been suggested in the literature to use γ = 1/d (see http://www.csie.ntu.edu.tw/~cjlin/libsvm/). However, for our numerical work, we considered γ = i/10d for 1 i 50. We also used 25 diﬀerent values for the box constraint in the interval [0.1, 100], which were equi-spaced in the logarithmic scale. For classiﬁcation tree, the deviance function was used as a measure of impurity, and the maximum height of the tree was restricted to 31. Nodes with less than 5 observations were never considered for splitting. We have combined the results of 500 trees in RF, where each tree was generated based on 63.2% randomly chosen observations from the training sample. At any stage, only a random subset of √d out of d variables were considered for splitting.