reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Systematically and efficiently improving $k$-means initialization by pairwise-nearest-neighbor smoothing

Authors: Carlo Baldassi

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. In particular, our method of enhancing k-means++ seeding proves superior in both eﬀectiveness and speed compared to the popular greedy k-means++ variant. Our implementation is publicly available at https://github.com/carlobaldassi/KMeans PNNSmoothing.jl. [...] In sec. 4 we present and analyze detailed numerical results on several challenging synthetic and non-synthetic datasets. [...] We present our most representative results in table 3; the complete results are reported in Appendix A.1. All algorithms except for pnn and those in the pnns family fail badly in at least some dataset.
Researcher Affiliation	Academia	Carlo Baldassi EMAIL Department of Computing Sciences, Art Lab, BIDSA Bocconi University, Milan ELLIS Scholar
Pseudocode	Yes	Algorithm 1: PNNS seeding Input: X, k, init, ρ 1. Set J = lpρN/ (2k)m . Cap the result at N/k . 2. Split X into J random subsets Xa 3. Cluster each Xa independently, using init for seeding followed by Lloyd s algorithm; obtain J conﬁgurations Ca, Pa , with k clusters each. 4. Collect the J conﬁgurations Ca, Pa into a single conﬁguration (C0, P0) for the entire X, with k J centroids and clusters. 5. Merge the clusters of (C0, P0), two at a time, using the PNN procedure, until k clusters remain; obtain a set of k centroids C. 6. Compute the optimal partition P associated to C. Output: (C, P)
Open Source Code	Yes	Our implementation is publicly available at https://github.com/carlobaldassi/KMeans PNNSmoothing.jl.
Open Datasets	Yes	We tested a number of synthetic and real-world datasets whose characteristics are shown in table 2. The synthetic datasets are mainly intended to measure the ability of the algorithms to ﬁnd the solution when one can be clearly identiﬁed, and for all of them the global optimum of the SSE is very close to the ground truth. Under these circumstances, it is reasonable to classify the local minima conﬁgurations that the algorithms produce by their centroid index (CI), as deﬁned in Fränti et al. (2014). The real-world datasets on the other hand do not generally have a simple structure or a simple solution , and their optimum SSE is unknown. Our tests in sec. 4.3 will show that the pnns scheme can provide systematic improvements in the SSE with a bounded increase in computing time, generally oﬀering a better trade-oﬀthan alternative methods. We picked 5 of the most challenging ones, all obtained from the UEF repository (Fränti & Sieranoja, 2018): A3, Birch1, Birch2, Unbalance and Dim1024 (see table 2). We tested 7 challenging real-world datasets (see table 2). The ﬁrst three, Bridge, House and Miss America, were obtained from the UEF repository (Fränti & Sieranoja, 2018); Urban GB, Isolet and USCensus are from the UCI repository (Dua & Graﬀ, 2017) and Olivetti from scikit-learn (Pedregosa et al., 2011).
Dataset Splits	No	We split the data as evenly as possible, i.e. we create N mod J subsets of size N/J + 1 and J N mod J subsets of size N/J . This is easy to implement eﬃciently by just constructing a sorted list of indices, each index being repeated for the appropriate number of times, and shuﬄing it. Our preliminary testing showed that the algorithm is not sensitive to the details of the splitting procedure. We performed 100 tests for each dataset (30 for Urban GB and USCensus) and computed the average and minimum SSE cost achieved across the runs (the latter metric can provide an indication about what can be achieved with a multiple-restarts strategy by each algorithm), as well as the average running time, NDCs, Lloyd s iterations.
Hardware Specification	Yes	All the timings that we report refer to tests performed on the same hardware (Intel Core i7-9750H 2.60GHz CPU with 6 physical cores, 64Gb DDR4 2666MHz RAM, running Ubuntu Linux 20.04 with 5.15.0 kernel)
Software Dependencies	Yes	We used the same data structures and programming language (Julia v1.7.3) for all of them. For the local optimization part, i.e. Lloyd s algorithm, we implemented a number of techniques that can accelerate the computation, while keeping it exact, by skipping some updates at the cost of some bookkeeping: the reduced computation method (rc) by Kaukoranta et al. (1999); the Elkan method (elk) from Elkan (2003); the Hamerly method (ham) from Hamerly (2010); the Yinyang method (yy) from Ding et al. (2015); the exponion method (exp) from Newling & Fleuret (2016); the ball-kmeans method (ball) from Xia et al. (2022). For elk and yy we implemented the simpliﬁed versions described in Newling & Fleuret (2016). For each of the datasets that we tested we chose the accelerator technique that resulted in the fastest average convergence time, when using km++ for seeding, across at least 30 random repetitions, and used that accelerator for all other tests with that dataset.
Experiment Setup	Yes	The inputs of the procedure are the same as for any other algorithm (the dataset X and the number of clusters k), plus the subset-seeding algorithm init, and one extra parameter ρ, used to determine the number of subsets J. [...] For gkm++ we set s = 2 + log k . For ref(init) we normally set J = 10. For pnns(init) we used the value ρ = 1 throughout the main text. [...] In all our tests Lloyd s algorithm was run until convergence to a ﬁxed point. Our code is available at https://github.com/carlobaldassi/KMeans PNNSmoothing.jl.