reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Generalizations of Some Distance Based Classifiers for HDLSS Data

Authors: Sarbojit Roy, Soham Sarkar, Subhajit Dutta, Anil K. Ghosh

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	High-dimensional behavior of the proposed classiﬁers is studied theoretically. Numerical experiments with a variety of simulated examples as well as an extensive analysis of benchmark data sets from three diﬀerent databases exhibit advantages of the proposed methods.
Researcher Affiliation	Academia	Sarbojit Roy EMAIL Department of Mathematics and Statistics IIT Kanpur Kanpur 208016, India. Soham Sarkar EMAIL Institut de Math ematiques Ecole Polytechnique F ed erale de Lausanne 1015 Lausanne, Switzerland. Subhajit Dutta EMAIL Department of Mathematics and Statistics IIT Kanpur Kanpur 208016, India. Anil K. Ghosh EMAIL Theoretical Statistics and Mathematics Unit Indian Statistical Institute Kolkata 700108, India.
Pseudocode	No	The paper describes methods mathematically and in text, but no explicit pseudocode blocks or algorithm listings are provided.
Open Source Code	Yes	Our classiﬁers were implemented in R too, and the codes are available from this link.
Open Datasets	Yes	Numerical experiments with a variety of simulated examples as well as an extensive analysis of benchmark data sets from three diﬀerent databases exhibit advantages of the proposed methods. The Cricket X and EOGHorizontal Signal data sets are both 12 class problems from the UCR Time Series Classiﬁcation Archive (see Dau et al., 2018)... The GSE2685 data set (available at the Microarray database: http: //www.biolab.si/supp/bi-cancer/projections/) ... In the nutt2003v2 data set (available at the Compcancer database: https://schlieplab.org/Static/Supplements/Comp Cancer/datasets.htm)
Dataset Splits	Yes	For each example, we generated 50 observations from each class to form the training sample. Misclassiﬁcation rates of diﬀerent classiﬁers are computed based on a test set consisting of 500 (250 from each class) observations. In Example 4, the training samples sizes were set to be 50 and 25, respectively. For our analysis of the data sets in the Compcancer and Microarray databases, we randomly selected 50% of the observations (without replacement) corresponding to each class to form the training set. The rest of the observations were considered as test cases. For data sets from the UCR Archive, we combined the available training and test data, and randomly selected 50% of the observations from the combined set to form a new set of training observations, while keeping the proportions of observations from diﬀerent classes consistent. The other half was considered as the test set. This procedure was repeated 100 times over diﬀerent splits of the data set to obtain a stable estimate of the misclassiﬁcation rate.
Hardware Specification	No	The paper discusses computational experiments and simulations but does not specify any particular hardware used for these computations.
Software Dependencies	No	The R packages e1071, glmnet, RSNNS and Rand Pro were used for SVM, GLMNET, NNET and NN-RAND, respectively. Our classiﬁers were implemented in R too, and the codes are available from this link. The paper mentions software by name but does not provide specific version numbers for these packages or for R itself, which is required for reproducibility.
Experiment Setup	Yes	In each example, we simulated data for d = 50, 100, 250, 500 and 1000. The training sample was formed by generating 50 observations from each class (except Example 4) and a test set of size 500 (250 from each class) was used. In Example 4, the training samples sizes were set to be 50 and 25, respectively. This process was repeated 100 times to compute the average misclassiﬁcation rates, which are reported in Figure 8. For the proposed generalized and block-generalized classiﬁers, we used γ(t) = 1 e t and φ(t) = t. We used the radial basis function (RBF) kernel, i.e., Kθ(x, y) = exp{ θ x y 2} in non-linear SVM with θ {i/10d; 1 i 20} and reported the minimum misclassiﬁcation rate. For NNET, we used the sigmoid as its activation function. The number of hidden layers were allowed to vary in the set {1, 3, 5, 10}, and the minimum misclassiﬁcation rate was reported as NNET. We have used default values for the other parameters that were involved with these classiﬁers.