Faster Approximation Algorithms for k-Center via Data Reduction
Authors: Arnold Filtser, Shaofeng H.-C. Jiang, Yi Li, Anurag Murty Naredla, Ioannis Psarros, Qiaoyuan Yang, Qin Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments validate the performance of our coresets, with a focus on Theorem 1.1, since Theorem 1.1 leads to near-linear running time for k-CENTER when k = nc (0 < c < 1), which is likely to be practical. Our experiments are conducted on four real-world datasets of various sizes and dimensions, and we evaluate the speedup of the well-known Gonzalez s algorithm (Gonzalez, 1985) on our coreset. The experiments show that our coreset provides a consistently better tradeoff between the coreset size and clustering cost than baselines, and runs 2 to 4 times faster than directly running Gonzalez algorithm on the dataset, while still achieving comparable cost values. |
| Researcher Affiliation | Academia | 1Computer Science Department, Bar-Ilan University 2School of Computer Science, Peking University 3College of Computing and Data Science, Nanyang Technological University 4Institut f ur Informatik, University of Bonn 5Archimedes, Athena Research Center 6Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington. |
| Pseudocode | Yes | Algorithm 1 Covering based on consistent hashing 1: apx a γ-approximate of k-CENTER(P) using Lemma 2.2, where γ = poly(n). 2: tβ poly(d) exp(O(d/β2/3)) (as in Lemma 3.2) 3: for i = 0 to log γ do 4: τ apx Algorithm 2 Covering based on sampling 1: apx a γ-approximate of k-CENTER(P) using Lemma 2.2, where γ = poly(n) 2: for i 0 to log γ do 3: τ apx |
| Open Source Code | Yes | The research code for this experiment is available at a Git Hub repository. |
| Open Datasets | Yes | We use four real datasets: Kddcup (Stolfo et al., 1999), Covertype (Blackard, 1998), Census (Meek et al., 2001), and Fashion-MNIST (Xiao et al., 2017). |
| Dataset Splits | No | The paper mentions varying the 'target coreset size s' and evaluating 'the clustering cost on the full dataset,' but does not specify any training/test/validation splits for the datasets used in the experiments. It evaluates performance directly on the entire dataset or a reduced coreset. |
| Hardware Specification | Yes | All the experiments are run on a Mac Book Air 15.3 with an Apple M3 chip (8 cores, 2.22 GHz), 16GB RAM, and mac OS 14.4.1 (23E224). |
| Software Dependencies | Yes | All algorithms are implemented in C++ and compiled with Apple Clang version 15.0.0 at -O3 optimization level. |
| Experiment Setup | Yes | For all experiments, we set the number of centers k n, where n denotes the size of the dataset. For coreset baselines, we vary the target coreset size s (ranging from k up to 30k), compute each baseline coreset with the target size s, and evaluate the clustering cost on the full dataset. We report both the running time and the evaluated cost, averaged over 3 independent trials. |