A note on the $k$-means clustering for missing data
Authors: Yoshikazu Terada, Xin Guan
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we illustrate some numerical simulations to verify the inconsistency of k-POD. [...] Figure 3 shows that the MSE of k-means with complete cases (dotted line) gradually approaches that of k-means with all original data (solid line) as n increases. [...] Finally, we present a real data example using the Wine dataset from the UCI Machine Learning Repository [...]. The average misclassification rates are summarized in Table 2. |
| Researcher Affiliation | Academia | Yoshikazu Terada EMAIL Graduate School of Engineering Science, The University of Osaka Center for Advanced Integrated Intelligence Research, RIKEN. Xin Guan EMAIL Graduate School of Information Sciences, Tohoku University |
| Pseudocode | No | The paper describes the k-POD clustering algorithm and its loss function, and mentions the majorization-minimization algorithm, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions that "Chi et al. (2016) provides the R package kpodclustr including the implementation of the k-POD clustering with a single specific initialization." This refers to code provided by a *referenced* work (Chi et al., 2016), not code released by the authors of *this* paper for their own methodology or experiments. No specific link or statement of code release from the current authors is provided. |
| Open Datasets | Yes | Finally, we present a real data example using the Wine dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). |
| Dataset Splits | Yes | We randomly select 50 samples as a validation set and use the remaining 128 samples for training. [...] we randomly introduce missingness into the training data at fixed rates (10% and 30%). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments or simulations. |
| Software Dependencies | No | The paper mentions "R package kpodclustr" in the context of a previous work, but it does not specify any software dependencies with version numbers for the experiments conducted in this paper. |
| Experiment Setup | Yes | More precisely, to initialize the cluster centers in the presence of missing values, we proceed as follows. We first compute the column-wise means of the observed entries and use these to impute the missing values in the data matrix. Each missing entry is replaced with the corresponding column mean. We then randomly select k data points from the imputed data to serve as the initial cluster centers. If any of the selected points are duplicated (i.e., some centers are identical due to imputation), we add small random noise to each entry of the initial centers to ensure diversity and numerical stability. To mitigate the effect of local minima, we perform 1000 random initializations and retain the solution with the lowest loss. [...] The missing rate for each variable, denoted by q, is set uniformly across variables and takes values in {10%, 30%, 50%}. [...] In the following, we use data standardized to have zero mean and unit variance for each variable. |