reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Regularized K-means Through Hard-Thresholding

Authors: Jakob Raymaekers, Ruben H. Zamar

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a careful theoretical and empirical study of the regularized K-means approach and build a general regularization framework based on direct penalization of the size of the cluster centers, in which we consider lasso, ridge, group-lasso and ℓ0-type penalties. We present a general iterative algorithm with a sparse starting value for the estimation of the cluster centers for each of these penalties, which gives insight into the eﬀect of the diﬀerent penalties on the estimated cluster centers. Based on this study, we propose the use of the ℓ0 penalty, which results in our proposal of the hard-thresholding K-means algorithm (HTK-means). The complementary empirical studies indicate that: 1. the ℓ0 penalty of HTK-means generally outperforms the lasso, ridge and group-lasso penalties in terms of cluster recovery and speed of computation (Section 5.2) 2. the absence of shrinkage in the estimated cluster centers allows for intuitive visualizations (Section 6)
Researcher Affiliation	Academia	Jakob Raymaekers EMAIL Department of Quantitative Economics Maastricht University Maastricht, The Netherlands Department of Mathematics KU Leuven Leuven, Belgium Ruben H. Zamar EMAIL Department of Statistics University of British Columbia Earth Sciences Building, 2207 Main Mall Vancouver, Canada.
Pseudocode	No	The paper describes the iterative algorithm verbally: 'Given an initial set of cluster centers: 1. Update the cluster indices ˆC by minimizing Equation 1 with respect to the cluster indices while keeping the cluster centers ﬁxed. 2. Update the cluster centers ˆµ by minimizing Equation 2 with respect to the cluster centers while keeping the cluster indices ﬁxed. 3. Repeat 1. and 2. until convergence.' This is not presented as a structured pseudocode block.
Open Source Code	Yes	The code for HTK-means and the proposed diagnostic plots is included in the R package cluster HD (Raymaekers and Zamar, 2022) available on CRAN.
Open Datasets	Yes	To illustrate this, we consider the classical example of Fisher s Iris data (Fisher, 1936), collected by Anderson (1935). The data consists of 150 iris ﬂowers which are described by 4 variables characterizing the dimensions of their sepal and petal. The flowers can be subdivided in 50 samples of each of three types of iris: Iris setosa, versicolor, and virginica. ... As a second example we consider the banknote data set, which consists of six measurements for 100 genuine and 100 counterfeit old-Swiss 1000-franc bank notes. The data was analyzed by Flury and Riedwyl (1988) and is publicly available in the R-package mclust (Scrucca et al., 2016). ... We analyze the gene expression data publicly available in the R-package anti Profiles Data (Bravo et al., 2020) and contains samples of normal colon tissue and colon cancer tissue collected from the Gene Expression Omnibus (Edgar et al., 2002; Barrett et al., 2012).
Dataset Splits	Yes	Wang (2010) propose to repeatedly split the data into 3 parts, 2 training sets and one validation set. ... 1. stab1: Fang and Wang (2012) propose instead to use bootstrap samples by taking for each replication, 2 bootstrap data sets of size n, after which the original data is used as validation set. 2. stab2: Ben-Hur et al. (2001); Haslbeck and Wulﬀ(2020) take the intersection of unique samples of the 2 bootstrapped training data sets as validation set. 3. stab3: Sun et al. (2012) use a variation where a third bootstrap data set is taken as validation set.
Hardware Specification	No	The paper does not explicitly describe any specific hardware (e.g., GPU, CPU models) used for running the experiments. It mentions 'computation times' and 'highly optimized C code' vs 'pure R code' but no hardware details.
Software Dependencies	No	The paper mentions several R packages like 'R package cluster HD (Raymaekers and Zamar, 2022)', 'R-package sparcl by Witten and Tibshirani (2018)', 'R-package mclust (Scrucca et al., 2016)', and 'R-package anti Profiles Data (Bravo et al., 2020)'. However, it does not provide specific version numbers for these packages or for R itself, which is required for a reproducible description of software dependencies.
Experiment Setup	Yes	The data generation process starts from the approach of Sun et al. (2012) and extends this in several directions. We generate data sets of n {80, 800} observations x1, . . . , xn in p {50, 200, 500, 1000} dimensions. First the true cluster assignment vector y = y1, . . . , yn is sampled from {1, . . . , K} where K {2, 4, 8}. ... We vary the value of γ in {0.4, 0.5, 0.6, 0.7, 0.8}. ... Each of the methods is calculated on a grid of 40 lambda values given by 10ˆ{-2+4i/40}, for i = 0, 1, . . . , 39, after which the best solution is retained. ... For the stability based methods, 20 replications of the resampling strategy were used.