A New Algorithm and Theory for Penalized Regression-based Clustering

Authors: Chong Wu, Sunghoon Kwon, Xiaotong Shen, Wei Pan

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerically, we compare the DC-ADMM algorithm with the quadratic penalty algorithm to demonstrate its utility and scalability. Theoretically, we establish a finite-sample mis-clustering error bound for penalized regression based clustering with the L0 constrained regularization in a general setting. ... A simulation study is then performed to demonstrate the numerical performance of the new algorithm as compared to other methods.
Researcher Affiliation Academia Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA; Department of Applied Statistics, Konkuk University, Seoul, South Korea; School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
Pseudocode Yes Algorithm 1: DC-ADMM for penalized regression based clustering Input : n observations X = {x1, . . . , xn}; tuning parameters λ, τ and ρ. 1 Initialize: Set m = 0, ˆu(0) ij = 0, ˆµ(0) i = xi and ˆθ(0) ij = xi xj for 1 i < j n. 2 while m = 0 or S(ˆµ(m), ˆθ(m)) S(ˆµ(m 1), ˆθ(m 1)) < 0 do 4 Update ˆµ(m) and ˆθ(m) based on (5) until convergence with a standard ADMM. 5 end Output: Estimated centroids for the observations, ˆµ1, . . . , ˆµn, from which a cluster label for each observation is assigned.
Open Source Code Yes As an end product, we put R package prclust implementing PRclust with various loss and grouping penalty functions available on Git Hub and CRAN. ... As a by-product of this new method, we make R package prclust implementing both the quadratic penalty based algorithm and DC-ADMM available in CRAN (https://cran.r-project.org) and Git Hub (https://github.com/Chong Wu-Biostat/prclust).
Open Datasets No Consider two overlapped convex clusters with the same spherical shape in two dimensions. Specifically, a random sample of n = 100 observations was generated, with 50 from a bivariate Gaussian distribution N((0, 0) , 0.33I), while the other 50 from N((1, 1) , 0.33I), where I is the identity matrix.
Dataset Splits Yes That is, (1) randomly partition the entire data set into a training set and a test set with an almost equal size; (2) cluster the training and test sets separately via PRclust with the same tuning parameters; (3) measure how well the training set clusters predict the test clusters. To be specific, first, randomly partition the entire data set into a training set Xtr and a test set Xte with a roughly equal size.
Hardware Specification No The paper discusses run-time comparison of algorithms but does not specify the hardware (e.g., CPU, GPU models, memory) used for conducting these experiments or obtaining the run-time measurements.
Software Dependencies No The paper mentions an "R package prclust" but does not specify the version of R or any other software libraries with their version numbers.
Experiment Setup Yes For PRclust, we searched τ {0.1, 0.2, . . . , 1} and λ {0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 1, 1.5, 2}. ... In this paper, we fix ρ = 0.4 throughout for simplicity.