Faster Approximation Algorithms for k-Center via Data Reduction

Authors: Arnold Filtser, Shaofeng H.-C. Jiang, Yi Li, Anurag Murty Naredla, Ioannis Psarros, Qiaoyuan Yang, Qin Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments validate the performance of our coresets, with a focus on Theorem 1.1, since Theorem 1.1 leads to near-linear running time for k-CENTER when k = nc (0 < c < 1), which is likely to be practical. Our experiments are conducted on four real-world datasets of various sizes and dimensions, and we evaluate the speedup of the well-known Gonzalez s algorithm (Gonzalez, 1985) on our coreset. The experiments show that our coreset provides a consistently better tradeoff between the coreset size and clustering cost than baselines, and runs 2 to 4 times faster than directly running Gonzalez algorithm on the dataset, while still achieving comparable cost values.
Researcher Affiliation Academia 1Computer Science Department, Bar-Ilan University 2School of Computer Science, Peking University 3College of Computing and Data Science, Nanyang Technological University 4Institut f ur Informatik, University of Bonn 5Archimedes, Athena Research Center 6Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington.
Pseudocode Yes Algorithm 1 Covering based on consistent hashing 1: apx a γ-approximate of k-CENTER(P) using Lemma 2.2, where γ = poly(n). 2: tβ poly(d) exp(O(d/β2/3)) (as in Lemma 3.2) 3: for i = 0 to log γ do 4: τ apx Algorithm 2 Covering based on sampling 1: apx a γ-approximate of k-CENTER(P) using Lemma 2.2, where γ = poly(n) 2: for i 0 to log γ do 3: τ apx
Open Source Code Yes The research code for this experiment is available at a Git Hub repository.
Open Datasets Yes We use four real datasets: Kddcup (Stolfo et al., 1999), Covertype (Blackard, 1998), Census (Meek et al., 2001), and Fashion-MNIST (Xiao et al., 2017).
Dataset Splits No The paper mentions varying the 'target coreset size s' and evaluating 'the clustering cost on the full dataset,' but does not specify any training/test/validation splits for the datasets used in the experiments. It evaluates performance directly on the entire dataset or a reduced coreset.
Hardware Specification Yes All the experiments are run on a Mac Book Air 15.3 with an Apple M3 chip (8 cores, 2.22 GHz), 16GB RAM, and mac OS 14.4.1 (23E224).
Software Dependencies Yes All algorithms are implemented in C++ and compiled with Apple Clang version 15.0.0 at -O3 optimization level.
Experiment Setup Yes For all experiments, we set the number of centers k n, where n denotes the size of the dataset. For coreset baselines, we vary the target coreset size s (ranging from k up to 30k), compute each baseline coreset with the target size s, and evaluate the clustering cost on the full dataset. We report both the running time and the evaluated cost, averaged over 3 independent trials.