Improved seeding strategies for k-means and k-GMM
Authors: Guillaume Carrière, Frederic Cazals
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show a significant improvement over classical contenders. In particular, for k-means, our methods improve on the recently designed multi-swap strategy (similar results in terms of SSE, seeding 6 faster), which was the first one to outperform the greedy k-means++ seeding. Our experimental analysis also shed light on subtle properties of k-means often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods. |
| Researcher Affiliation | Academia | Guillaume Carrière EMAIL Centre Inria at Université Côte d Azur, France Frédéric Cazals EMAIL Centre Inria at Université Côte d Azur, France |
| Pseudocode | Yes | Algorithm 1 Seeding-EON-EON. 1: procedure Seeding-EON-EON(data, K) 2: centers Seeding-EON (data, K) 3: Reselect centers in reverse order 4: for k K to 1 do 5: Delete centers[k] 6: Choose new_ck with the D2 strategy 7: Insert new_ck in centers at position k ... Algorithm 2 The classical Means2GMM algorithm. The variant Means2Sph GMM consists of changing the full anisotropic estimation of line 7 by the isotropic estimation of line 8. (Blömer & Bujna, 2016). |
| Open Source Code | Yes | A python implementation of the best contender, Seeding-EGD-EGC, is available from https://gitlab.inria.fr/abs/improved-seeding-kmeans-kgmm. An optimized C++ implementation is also available in the package SBL::Cluster_engines of the Structural Bioinformatics Library https://sbl.inria.fr/doc/Cluster_engines-user-manual.html. |
| Open Datasets | Yes | Our experiments involve 12 datasets from the UCI Machine Learning Repository, 11 of which are a subset of the 32 used in (Celebi et al., 2013): {Cloud cover, Corel image features, Steel plates faults, Letter recognition, Multiple features, Musk (Clean2), Optical digits, Pen digits, Image segmentation, Shuttle (Statlog), Spambase, Yeast} 1. ... We also process the {KDD-BIO, KDD-PHY, RNA} datasets from (Lattanzi & Sohler, 2019) used in the assessment of the SOTA method k-means++-MS-G (Beretta et al., 2023). ... Following (Blömer & Bujna, 2013), we use GMMs defined from the following parameters: (NB: code available from https://github.com/mdqyy/simple-gmm-initializations.) |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits needed to reproduce the experiment. For the unsupervised clustering algorithms (k-means and k-GMM), the entire datasets are typically used for clustering, and evaluation is conducted over multiple runs or on the full dataset without a traditional train/test split. |
| Hardware Specification | Yes | Calculations were run an a HP desktop computer running Fedora Core 39, equipped with 24 CPUs (i9-13900K) and 131 GB or RAM. |
| Software Dependencies | No | All methods were implemented in C++ using the Eigen library. The paper does not specify version numbers for C++ or the Eigen library, nor for Python which is implied for one of the implementations mentioned. Therefore, specific ancillary software details with version numbers are not provided. |
| Experiment Setup | Yes | Implementation and stop criterion. We compare our seeding methods with k-means++-LS (with Z = k) and k-means++-MS-G (with Z = K, p = 2 + log(K)) to initialize k-means. We chose Z = k to match the number of swaps performed by our methods, and p = 2 + log(K) to match the candidate pool size originally proposed for greedy kmeans++ in (Arthur & Vassilvitskii, 2007). ... The Lloyd iterations are stopped when the Frobenius norm of the difference in the center clusters is smaller than 1e 4. As a failsafe, the Lloyd iterations are also stopped after reaching a maximum number of iterations of 50. ... For a given dataset, final log-likelihood values are obtained by averaging the results of R = 30 runs of each initialization method in order to assess the variance inherent to their randomness. |