reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptively Robust and Sparse $K$-means Clustering

Authors: HAO LI, Shonosuke Sugasawa, Shota Katayama

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through simulation experiments and real data analysis, we demonstrate the proposed method s superiority to existing algorithms in identifying clusters without outliers and informative variables simultaneously. Section 3 Numerical Studies
Researcher Affiliation	Academia	Hao Li EMAIL Graduate School of Economics Keio University, Shonosuke Sugasawa EMAIL Faculty of Economics Keio University, Shota Katayama EMAIL Faculty of Economics Keio University
Pseudocode	Yes	In order to gain a deeper comprehension of the algorithmic process we have proposed, we summarize the aforementioned procedures in a pseudocode in Algorithm 1. Moreover, the computation complexity of T iteration of Algorithm 1 can be represented O(npt TK), assuming that the number of iterations in the 4th step is O(t). We also offer a pseudocode for this procedure for easy understanding of this search algorithm given in Algorithm 2.
Open Source Code	Yes	R code implementing the proposed method is available at the Git Hub repository (https://github.com/lee1995hao/ARSK).
Open Datasets	Yes	We consider the dataset all from the UCI Machine Learning Repository (Dua & Graff (2019)). The dataset was downloaded from Kaggle, a data science competition platform.
Dataset Splits	No	The paper describes data generation for simulations and uses real-world datasets, but it does not specify explicit training, validation, or test splits for these datasets. It refers to a clustering error rate calculation for a given set of observations, but not how that set is partitioned.
Hardware Specification	No	The paper does not mention any specific hardware specifications (GPU models, CPU types, memory, etc.) used for running the experiments.
Software Dependencies	No	R code implementing the proposed method is available at the Git Hub repository (https://github.com/lee1995hao/ARSK).
Experiment Setup	Yes	In this section, we explore the ability of the proposed clustering method. We consider that each observation xi is generated independently from a multivariate normal distribution, given that the observation belongs to cluster k. Specifically, for an observation xi in cluster k, we have xi N(µk,:, Σp), where µk,: Rp denotes the mean vector for cluster k, and Σp Rp p represents the covariance matrix. ... In the simulation study, we consider two types of Σp. ... The tuning parameter search process is repeated 30 times for 30 different datasets, and the resulting table is presented in Table 3.2. We consider the number of permuted datasets to be 25 (i.e., B = 25). ... In this paper, we set a as equal to 3.7, as recommended in Fan & Li (2001). ... We considered 3 clusters, each containing 50 observations, number of variables as p = 50, with q = 5, and all variables are independent, i.e., Σp = Ip. A proper of hyperparameters λ1 and λ2 should make the model accurately identify the structure of the clustering data in different contamination levels for π in {0, 0.1, 0.2, 0.3}.