A note on the $k$-means clustering for missing data

Authors: Yoshikazu Terada, Xin Guan

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we illustrate some numerical simulations to verify the inconsistency of k-POD. [...] Figure 3 shows that the MSE of k-means with complete cases (dotted line) gradually approaches that of k-means with all original data (solid line) as n increases. [...] Finally, we present a real data example using the Wine dataset from the UCI Machine Learning Repository [...]. The average misclassification rates are summarized in Table 2.
Researcher Affiliation Academia Yoshikazu Terada EMAIL Graduate School of Engineering Science, The University of Osaka Center for Advanced Integrated Intelligence Research, RIKEN. Xin Guan EMAIL Graduate School of Information Sciences, Tohoku University
Pseudocode No The paper describes the k-POD clustering algorithm and its loss function, and mentions the majorization-minimization algorithm, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions that "Chi et al. (2016) provides the R package kpodclustr including the implementation of the k-POD clustering with a single specific initialization." This refers to code provided by a *referenced* work (Chi et al., 2016), not code released by the authors of *this* paper for their own methodology or experiments. No specific link or statement of code release from the current authors is provided.
Open Datasets Yes Finally, we present a real data example using the Wine dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml).
Dataset Splits Yes We randomly select 50 samples as a validation set and use the remaining 128 samples for training. [...] we randomly introduce missingness into the training data at fixed rates (10% and 30%).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments or simulations.
Software Dependencies No The paper mentions "R package kpodclustr" in the context of a previous work, but it does not specify any software dependencies with version numbers for the experiments conducted in this paper.
Experiment Setup Yes More precisely, to initialize the cluster centers in the presence of missing values, we proceed as follows. We first compute the column-wise means of the observed entries and use these to impute the missing values in the data matrix. Each missing entry is replaced with the corresponding column mean. We then randomly select k data points from the imputed data to serve as the initial cluster centers. If any of the selected points are duplicated (i.e., some centers are identical due to imputation), we add small random noise to each entry of the initial centers to ensure diversity and numerical stability. To mitigate the effect of local minima, we perform 1000 random initializations and retain the solution with the lowest loss. [...] The missing rate for each variable, denoted by q, is set uniformly across variables and takes values in {10%, 30%, 50%}. [...] In the following, we use data standardized to have zero mean and unit variance for each variable.