reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A note on the $k$-means clustering for missing data

Authors: Yoshikazu Terada, Xin Guan

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we illustrate some numerical simulations to verify the inconsistency of k-POD. [...] Figure 3 shows that the MSE of k-means with complete cases (dotted line) gradually approaches that of k-means with all original data (solid line) as n increases. [...] Finally, we present a real data example using the Wine dataset from the UCI Machine Learning Repository [...]. The average misclassification rates are summarized in Table 2.
Researcher Affiliation	Academia	Yoshikazu Terada EMAIL Graduate School of Engineering Science, The University of Osaka Center for Advanced Integrated Intelligence Research, RIKEN. Xin Guan EMAIL Graduate School of Information Sciences, Tohoku University
Pseudocode	No	The paper describes the k-POD clustering algorithm and its loss function, and mentions the majorization-minimization algorithm, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions that "Chi et al. (2016) provides the R package kpodclustr including the implementation of the k-POD clustering with a single specific initialization." This refers to code provided by a referenced work (Chi et al., 2016), not code released by the authors of this paper for their own methodology or experiments. No specific link or statement of code release from the current authors is provided.
Open Datasets	Yes	Finally, we present a real data example using the Wine dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml).
Dataset Splits	Yes	We randomly select 50 samples as a validation set and use the remaining 128 samples for training. [...] we randomly introduce missingness into the training data at ﬁxed rates (10% and 30%).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments or simulations.
Software Dependencies	No	The paper mentions "R package kpodclustr" in the context of a previous work, but it does not specify any software dependencies with version numbers for the experiments conducted in this paper.
Experiment Setup	Yes	More precisely, to initialize the cluster centers in the presence of missing values, we proceed as follows. We first compute the column-wise means of the observed entries and use these to impute the missing values in the data matrix. Each missing entry is replaced with the corresponding column mean. We then randomly select k data points from the imputed data to serve as the initial cluster centers. If any of the selected points are duplicated (i.e., some centers are identical due to imputation), we add small random noise to each entry of the initial centers to ensure diversity and numerical stability. To mitigate the effect of local minima, we perform 1000 random initializations and retain the solution with the lowest loss. [...] The missing rate for each variable, denoted by q, is set uniformly across variables and takes values in {10%, 30%, 50%}. [...] In the following, we use data standardized to have zero mean and unit variance for each variable.