reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improved seeding strategies for k-means and k-GMM

Authors: Guillaume Carrière, Frederic Cazals

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show a significant improvement over classical contenders. In particular, for k-means, our methods improve on the recently designed multi-swap strategy (similar results in terms of SSE, seeding 6 faster), which was the first one to outperform the greedy k-means++ seeding. Our experimental analysis also shed light on subtle properties of k-means often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods.
Researcher Affiliation	Academia	Guillaume Carrière EMAIL Centre Inria at Université Côte d Azur, France Frédéric Cazals EMAIL Centre Inria at Université Côte d Azur, France
Pseudocode	Yes	Algorithm 1 Seeding-EON-EON. 1: procedure Seeding-EON-EON(data, K) 2: centers Seeding-EON (data, K) 3: Reselect centers in reverse order 4: for k K to 1 do 5: Delete centers[k] 6: Choose new_ck with the D2 strategy 7: Insert new_ck in centers at position k ... Algorithm 2 The classical Means2GMM algorithm. The variant Means2Sph GMM consists of changing the full anisotropic estimation of line 7 by the isotropic estimation of line 8. (Blömer & Bujna, 2016).
Open Source Code	Yes	A python implementation of the best contender, Seeding-EGD-EGC, is available from https://gitlab.inria.fr/abs/improved-seeding-kmeans-kgmm. An optimized C++ implementation is also available in the package SBL::Cluster_engines of the Structural Bioinformatics Library https://sbl.inria.fr/doc/Cluster_engines-user-manual.html.
Open Datasets	Yes	Our experiments involve 12 datasets from the UCI Machine Learning Repository, 11 of which are a subset of the 32 used in (Celebi et al., 2013): {Cloud cover, Corel image features, Steel plates faults, Letter recognition, Multiple features, Musk (Clean2), Optical digits, Pen digits, Image segmentation, Shuttle (Statlog), Spambase, Yeast} 1. ... We also process the {KDD-BIO, KDD-PHY, RNA} datasets from (Lattanzi & Sohler, 2019) used in the assessment of the SOTA method k-means++-MS-G (Beretta et al., 2023). ... Following (Blömer & Bujna, 2013), we use GMMs defined from the following parameters: (NB: code available from https://github.com/mdqyy/simple-gmm-initializations.)
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits needed to reproduce the experiment. For the unsupervised clustering algorithms (k-means and k-GMM), the entire datasets are typically used for clustering, and evaluation is conducted over multiple runs or on the full dataset without a traditional train/test split.
Hardware Specification	Yes	Calculations were run an a HP desktop computer running Fedora Core 39, equipped with 24 CPUs (i9-13900K) and 131 GB or RAM.
Software Dependencies	No	All methods were implemented in C++ using the Eigen library. The paper does not specify version numbers for C++ or the Eigen library, nor for Python which is implied for one of the implementations mentioned. Therefore, specific ancillary software details with version numbers are not provided.
Experiment Setup	Yes	Implementation and stop criterion. We compare our seeding methods with k-means++-LS (with Z = k) and k-means++-MS-G (with Z = K, p = 2 + log(K)) to initialize k-means. We chose Z = k to match the number of swaps performed by our methods, and p = 2 + log(K) to match the candidate pool size originally proposed for greedy kmeans++ in (Arthur & Vassilvitskii, 2007). ... The Lloyd iterations are stopped when the Frobenius norm of the difference in the center clusters is smaller than 1e 4. As a failsafe, the Lloyd iterations are also stopped after reaching a maximum number of iterations of 50. ... For a given dataset, final log-likelihood values are obtained by averaging the results of R = 30 runs of each initialization method in order to assess the variance inherent to their randomness.