Adaptively Robust and Sparse $K$-means Clustering

Authors: HAO LI, Shonosuke Sugasawa, Shota Katayama

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through simulation experiments and real data analysis, we demonstrate the proposed method s superiority to existing algorithms in identifying clusters without outliers and informative variables simultaneously. Section 3 Numerical Studies
Researcher Affiliation Academia Hao Li EMAIL Graduate School of Economics Keio University, Shonosuke Sugasawa EMAIL Faculty of Economics Keio University, Shota Katayama EMAIL Faculty of Economics Keio University
Pseudocode Yes In order to gain a deeper comprehension of the algorithmic process we have proposed, we summarize the aforementioned procedures in a pseudocode in Algorithm 1. Moreover, the computation complexity of T iteration of Algorithm 1 can be represented O(npt TK), assuming that the number of iterations in the 4th step is O(t). We also offer a pseudocode for this procedure for easy understanding of this search algorithm given in Algorithm 2.
Open Source Code Yes R code implementing the proposed method is available at the Git Hub repository (https://github.com/lee1995hao/ARSK).
Open Datasets Yes We consider the dataset all from the UCI Machine Learning Repository (Dua & Graff (2019)). The dataset was downloaded from Kaggle, a data science competition platform.
Dataset Splits No The paper describes data generation for simulations and uses real-world datasets, but it does not specify explicit training, validation, or test splits for these datasets. It refers to a clustering error rate calculation for a given set of observations, but not how that set is partitioned.
Hardware Specification No The paper does not mention any specific hardware specifications (GPU models, CPU types, memory, etc.) used for running the experiments.
Software Dependencies No R code implementing the proposed method is available at the Git Hub repository (https://github.com/lee1995hao/ARSK).
Experiment Setup Yes In this section, we explore the ability of the proposed clustering method. We consider that each observation xi is generated independently from a multivariate normal distribution, given that the observation belongs to cluster k. Specifically, for an observation xi in cluster k, we have xi N(µk,:, Σp), where µk,: Rp denotes the mean vector for cluster k, and Σp Rp p represents the covariance matrix. ... In the simulation study, we consider two types of Σp. ... The tuning parameter search process is repeated 30 times for 30 different datasets, and the resulting table is presented in Table 3.2. We consider the number of permuted datasets to be 25 (i.e., B = 25). ... In this paper, we set a as equal to 3.7, as recommended in Fan & Li (2001). ... We considered 3 clusters, each containing 50 observations, number of variables as p = 50, with q = 5, and all variables are independent, i.e., Σp = Ip. A proper of hyperparameters λ1 and λ2 should make the model accurately identify the structure of the clustering data in different contamination levels for π in {0, 0.1, 0.2, 0.3}.