reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Modified K-means Algorithm with Local Optimality Guarantees

Authors: Mingyi Li, Michael R. Metel, Akiko Takeda

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments conﬁrm that the K-means algorithm does not always ﬁnd a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. By performing extensive experiments using both synthetic and real-world datasets using different Bregman divergences, we ﬁnd that our modiﬁed K-means algorithms (LO-K-means) result in improved solutions with decreased clustering loss.
Researcher Affiliation	Collaboration	1Department of Mathematical Informatics, The University of Tokyo, Tokyo, Japan 2Huawei Noah s Ark Lab, Montr eal, Canada 3Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan.
Pseudocode	Yes	Function 1 Guarantees Convergence to C-local Function 2 Guarantees Convergence to D-local Algorithm 1 Locally-Optimal K-means Algorithm (LO-Kmeans) Algorithm 2 K-means Algorithm Function 3 Variant of Function 2 Minimizing 1
Open Source Code	Yes	Our code is available at https://github.com/lmingyi/LO-K-means.
Open Datasets	Yes	We conducted experiments using both synthetic and real-world datasets to validate the performance of the LO-Kmeans algorithm (Algorithm 1). ... Iris (Fisher, 1936): This dataset consists of 150 instances and 4 features, where each instance represents a plant. Wine Quality (Cortez et al., 2009): A dataset with 6,497 instances and 11 features, where each instance corresponds to a wine sample with quality ratings. Yeast (Nakai & Kanehisa, 1991; 1992): This dataset contains 1,484 instances and 8 features, where each instance represents a protein sample with attributes related to its cellular localization. Predict Students Dropout and Academic Success (Martins et al., 2021): This dataset consists of 4,424 instances and 36 features related to students academic performance and dropout risk. News20 (scikit-learn developers, 2017): This dataset contains 11,314 instances and 131,017 features, representing word frequencies in news articles.
Dataset Splits	No	The synthetic datasets were generated as follows. We uniformly sampled N data points from the space [1, 10]d, restricted to integer values. If the same point is selected multiple times, the number of selections is assigned as its weight. The evaluation was conducted for cluster sizes of K = 5, 10, 25, and 50. For the News20 dataset, experiments were conducted in two cases: 1. Using the ﬁrst 2,000 instances, focusing on the 1,089 features with word frequencies between 2% and 80%. 2. Using the ﬁrst 200 instances and 131,017 features. The paper does not provide specific training/test/validation dataset splits; it describes dataset generation and subsampling for experiments.
Hardware Specification	No	The paper discusses computational complexity but does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions `scikit-learn` and `NumPy` but does not provide specific version numbers for these or any other key software dependencies used for implementation or experiments.
Experiment Setup	Yes	For the initialization of cluster centers in the above algorithms, we use two methods: uniform random sampling of K points without replacement and the K-means++ method (Arthur & Vassilvitskii, 2006), as implemented in scikit-learn. The evaluation was conducted for cluster sizes of K = 5, 10, 25, and 50. We observe that limiting the number of iterations to 200 300 leads to substantial improvements in computational efﬁciency... We also note that this analysis is in line with the default iteration limit of 300 for the K-means algorithm used in scikit-learn.