Systematically and efficiently improving $k$-means initialization by pairwise-nearest-neighbor smoothing

Authors: Carlo Baldassi

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. In particular, our method of enhancing k-means++ seeding proves superior in both effectiveness and speed compared to the popular greedy k-means++ variant. Our implementation is publicly available at https://github.com/carlobaldassi/KMeans PNNSmoothing.jl. [...] In sec. 4 we present and analyze detailed numerical results on several challenging synthetic and non-synthetic datasets. [...] We present our most representative results in table 3; the complete results are reported in Appendix A.1. All algorithms except for pnn and those in the pnns family fail badly in at least some dataset.
Researcher Affiliation Academia Carlo Baldassi EMAIL Department of Computing Sciences, Art Lab, BIDSA Bocconi University, Milan ELLIS Scholar
Pseudocode Yes Algorithm 1: PNNS seeding Input: X, k, init, ρ 1. Set J = lpρN/ (2k)m . Cap the result at N/k . 2. Split X into J random subsets Xa 3. Cluster each Xa independently, using init for seeding followed by Lloyd s algorithm; obtain J configurations Ca, Pa , with k clusters each. 4. Collect the J configurations Ca, Pa into a single configuration (C0, P0) for the entire X, with k J centroids and clusters. 5. Merge the clusters of (C0, P0), two at a time, using the PNN procedure, until k clusters remain; obtain a set of k centroids C. 6. Compute the optimal partition P associated to C. Output: (C, P)
Open Source Code Yes Our implementation is publicly available at https://github.com/carlobaldassi/KMeans PNNSmoothing.jl.
Open Datasets Yes We tested a number of synthetic and real-world datasets whose characteristics are shown in table 2. The synthetic datasets are mainly intended to measure the ability of the algorithms to find the solution when one can be clearly identified, and for all of them the global optimum of the SSE is very close to the ground truth. Under these circumstances, it is reasonable to classify the local minima configurations that the algorithms produce by their centroid index (CI), as defined in Fränti et al. (2014). The real-world datasets on the other hand do not generally have a simple structure or a simple solution , and their optimum SSE is unknown. Our tests in sec. 4.3 will show that the pnns scheme can provide systematic improvements in the SSE with a bounded increase in computing time, generally offering a better trade-offthan alternative methods. We picked 5 of the most challenging ones, all obtained from the UEF repository (Fränti & Sieranoja, 2018): A3, Birch1, Birch2, Unbalance and Dim1024 (see table 2). We tested 7 challenging real-world datasets (see table 2). The first three, Bridge, House and Miss America, were obtained from the UEF repository (Fränti & Sieranoja, 2018); Urban GB, Isolet and USCensus are from the UCI repository (Dua & Graff, 2017) and Olivetti from scikit-learn (Pedregosa et al., 2011).
Dataset Splits No We split the data as evenly as possible, i.e. we create N mod J subsets of size N/J + 1 and J N mod J subsets of size N/J . This is easy to implement efficiently by just constructing a sorted list of indices, each index being repeated for the appropriate number of times, and shuffling it. Our preliminary testing showed that the algorithm is not sensitive to the details of the splitting procedure. We performed 100 tests for each dataset (30 for Urban GB and USCensus) and computed the average and minimum SSE cost achieved across the runs (the latter metric can provide an indication about what can be achieved with a multiple-restarts strategy by each algorithm), as well as the average running time, NDCs, Lloyd s iterations.
Hardware Specification Yes All the timings that we report refer to tests performed on the same hardware (Intel Core i7-9750H 2.60GHz CPU with 6 physical cores, 64Gb DDR4 2666MHz RAM, running Ubuntu Linux 20.04 with 5.15.0 kernel)
Software Dependencies Yes We used the same data structures and programming language (Julia v1.7.3) for all of them. For the local optimization part, i.e. Lloyd s algorithm, we implemented a number of techniques that can accelerate the computation, while keeping it exact, by skipping some updates at the cost of some bookkeeping: the reduced computation method (rc) by Kaukoranta et al. (1999); the Elkan method (elk) from Elkan (2003); the Hamerly method (ham) from Hamerly (2010); the Yinyang method (yy) from Ding et al. (2015); the exponion method (exp) from Newling & Fleuret (2016); the ball-kmeans method (ball) from Xia et al. (2022). For elk and yy we implemented the simplified versions described in Newling & Fleuret (2016). For each of the datasets that we tested we chose the accelerator technique that resulted in the fastest average convergence time, when using km++ for seeding, across at least 30 random repetitions, and used that accelerator for all other tests with that dataset.
Experiment Setup Yes The inputs of the procedure are the same as for any other algorithm (the dataset X and the number of clusters k), plus the subset-seeding algorithm init, and one extra parameter ρ, used to determine the number of subsets J. [...] For gkm++ we set s = 2 + log k . For ref(init) we normally set J = 10. For pnns(init) we used the value ρ = 1 throughout the main text. [...] In all our tests Lloyd s algorithm was run until convergence to a fixed point. Our code is available at https://github.com/carlobaldassi/KMeans PNNSmoothing.jl.