Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Regularized K-means Through Hard-Thresholding

Authors: Jakob Raymaekers, Ruben H. Zamar

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a careful theoretical and empirical study of the regularized K-means approach and build a general regularization framework based on direct penalization of the size of the cluster centers, in which we consider lasso, ridge, group-lasso and ℓ0-type penalties. We present a general iterative algorithm with a sparse starting value for the estimation of the cluster centers for each of these penalties, which gives insight into the effect of the different penalties on the estimated cluster centers. Based on this study, we propose the use of the ℓ0 penalty, which results in our proposal of the hard-thresholding K-means algorithm (HTK-means). The complementary empirical studies indicate that: 1. the ℓ0 penalty of HTK-means generally outperforms the lasso, ridge and group-lasso penalties in terms of cluster recovery and speed of computation (Section 5.2) 2. the absence of shrinkage in the estimated cluster centers allows for intuitive visualizations (Section 6)
Researcher Affiliation Academia Jakob Raymaekers EMAIL Department of Quantitative Economics Maastricht University Maastricht, The Netherlands Department of Mathematics KU Leuven Leuven, Belgium Ruben H. Zamar EMAIL Department of Statistics University of British Columbia Earth Sciences Building, 2207 Main Mall Vancouver, Canada.
Pseudocode No The paper describes the iterative algorithm verbally: 'Given an initial set of cluster centers: 1. Update the cluster indices ˆC by minimizing Equation 1 with respect to the cluster indices while keeping the cluster centers fixed. 2. Update the cluster centers ˆµ by minimizing Equation 2 with respect to the cluster centers while keeping the cluster indices fixed. 3. Repeat 1. and 2. until convergence.' This is not presented as a structured pseudocode block.
Open Source Code Yes The code for HTK-means and the proposed diagnostic plots is included in the R package cluster HD (Raymaekers and Zamar, 2022) available on CRAN.
Open Datasets Yes To illustrate this, we consider the classical example of Fisher s Iris data (Fisher, 1936), collected by Anderson (1935). The data consists of 150 iris flowers which are described by 4 variables characterizing the dimensions of their sepal and petal. The flowers can be subdivided in 50 samples of each of three types of iris: Iris setosa, versicolor, and virginica. ... As a second example we consider the banknote data set, which consists of six measurements for 100 genuine and 100 counterfeit old-Swiss 1000-franc bank notes. The data was analyzed by Flury and Riedwyl (1988) and is publicly available in the R-package mclust (Scrucca et al., 2016). ... We analyze the gene expression data publicly available in the R-package anti Profiles Data (Bravo et al., 2020) and contains samples of normal colon tissue and colon cancer tissue collected from the Gene Expression Omnibus (Edgar et al., 2002; Barrett et al., 2012).
Dataset Splits Yes Wang (2010) propose to repeatedly split the data into 3 parts, 2 training sets and one validation set. ... 1. stab1: Fang and Wang (2012) propose instead to use bootstrap samples by taking for each replication, 2 bootstrap data sets of size n, after which the original data is used as validation set. 2. stab2: Ben-Hur et al. (2001); Haslbeck and Wulff(2020) take the intersection of unique samples of the 2 bootstrapped training data sets as validation set. 3. stab3: Sun et al. (2012) use a variation where a third bootstrap data set is taken as validation set.
Hardware Specification No The paper does not explicitly describe any specific hardware (e.g., GPU, CPU models) used for running the experiments. It mentions 'computation times' and 'highly optimized C code' vs 'pure R code' but no hardware details.
Software Dependencies No The paper mentions several R packages like 'R package cluster HD (Raymaekers and Zamar, 2022)', 'R-package sparcl by Witten and Tibshirani (2018)', 'R-package mclust (Scrucca et al., 2016)', and 'R-package anti Profiles Data (Bravo et al., 2020)'. However, it does not provide specific version numbers for these packages or for R itself, which is required for a reproducible description of software dependencies.
Experiment Setup Yes The data generation process starts from the approach of Sun et al. (2012) and extends this in several directions. We generate data sets of n {80, 800} observations x1, . . . , xn in p {50, 200, 500, 1000} dimensions. First the true cluster assignment vector y = y1, . . . , yn is sampled from {1, . . . , K} where K {2, 4, 8}. ... We vary the value of γ in {0.4, 0.5, 0.6, 0.7, 0.8}. ... Each of the methods is calculated on a grid of 40 lambda values given by 10ˆ{-2+4i/40}, for i = 0, 1, . . . , 39, after which the best solution is retained. ... For the stability based methods, 20 replications of the resampling strategy were used.