Modified K-means Algorithm with Local Optimality Guarantees
Authors: Mingyi Li, Michael R. Metel, Akiko Takeda
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments confirm that the K-means algorithm does not always find a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. By performing extensive experiments using both synthetic and real-world datasets using different Bregman divergences, we find that our modified K-means algorithms (LO-K-means) result in improved solutions with decreased clustering loss. |
| Researcher Affiliation | Collaboration | 1Department of Mathematical Informatics, The University of Tokyo, Tokyo, Japan 2Huawei Noah s Ark Lab, Montr eal, Canada 3Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan. |
| Pseudocode | Yes | Function 1 Guarantees Convergence to C-local Function 2 Guarantees Convergence to D-local Algorithm 1 Locally-Optimal K-means Algorithm (LO-Kmeans) Algorithm 2 K-means Algorithm Function 3 Variant of Function 2 Minimizing 1 |
| Open Source Code | Yes | Our code is available at https://github.com/lmingyi/LO-K-means. |
| Open Datasets | Yes | We conducted experiments using both synthetic and real-world datasets to validate the performance of the LO-Kmeans algorithm (Algorithm 1). ... Iris (Fisher, 1936): This dataset consists of 150 instances and 4 features, where each instance represents a plant. Wine Quality (Cortez et al., 2009): A dataset with 6,497 instances and 11 features, where each instance corresponds to a wine sample with quality ratings. Yeast (Nakai & Kanehisa, 1991; 1992): This dataset contains 1,484 instances and 8 features, where each instance represents a protein sample with attributes related to its cellular localization. Predict Students Dropout and Academic Success (Martins et al., 2021): This dataset consists of 4,424 instances and 36 features related to students academic performance and dropout risk. News20 (scikit-learn developers, 2017): This dataset contains 11,314 instances and 131,017 features, representing word frequencies in news articles. |
| Dataset Splits | No | The synthetic datasets were generated as follows. We uniformly sampled N data points from the space [1, 10]d, restricted to integer values. If the same point is selected multiple times, the number of selections is assigned as its weight. The evaluation was conducted for cluster sizes of K = 5, 10, 25, and 50. For the News20 dataset, experiments were conducted in two cases: 1. Using the first 2,000 instances, focusing on the 1,089 features with word frequencies between 2% and 80%. 2. Using the first 200 instances and 131,017 features. The paper does not provide specific training/test/validation dataset splits; it describes dataset generation and subsampling for experiments. |
| Hardware Specification | No | The paper discusses computational complexity but does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions `scikit-learn` and `NumPy` but does not provide specific version numbers for these or any other key software dependencies used for implementation or experiments. |
| Experiment Setup | Yes | For the initialization of cluster centers in the above algorithms, we use two methods: uniform random sampling of K points without replacement and the K-means++ method (Arthur & Vassilvitskii, 2006), as implemented in scikit-learn. The evaluation was conducted for cluster sizes of K = 5, 10, 25, and 50. We observe that limiting the number of iterations to 200 300 leads to substantial improvements in computational efficiency... We also note that this analysis is in line with the default iteration limit of 300 for the K-means algorithm used in scikit-learn. |