reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Faster Approximation Algorithms for k-Center via Data Reduction

Authors: Arnold Filtser, Shaofeng H.-C. Jiang, Yi Li, Anurag Murty Naredla, Ioannis Psarros, Qiaoyuan Yang, Qin Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments validate the performance of our coresets, with a focus on Theorem 1.1, since Theorem 1.1 leads to near-linear running time for k-CENTER when k = nc (0 < c < 1), which is likely to be practical. Our experiments are conducted on four real-world datasets of various sizes and dimensions, and we evaluate the speedup of the well-known Gonzalez s algorithm (Gonzalez, 1985) on our coreset. The experiments show that our coreset provides a consistently better tradeoff between the coreset size and clustering cost than baselines, and runs 2 to 4 times faster than directly running Gonzalez algorithm on the dataset, while still achieving comparable cost values.
Researcher Affiliation	Academia	1Computer Science Department, Bar-Ilan University 2School of Computer Science, Peking University 3College of Computing and Data Science, Nanyang Technological University 4Institut f ur Informatik, University of Bonn 5Archimedes, Athena Research Center 6Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington.
Pseudocode	Yes	Algorithm 1 Covering based on consistent hashing 1: apx a γ-approximate of k-CENTER(P) using Lemma 2.2, where γ = poly(n). 2: tβ poly(d) exp(O(d/β2/3)) (as in Lemma 3.2) 3: for i = 0 to log γ do 4: τ apx Algorithm 2 Covering based on sampling 1: apx a γ-approximate of k-CENTER(P) using Lemma 2.2, where γ = poly(n) 2: for i 0 to log γ do 3: τ apx
Open Source Code	Yes	The research code for this experiment is available at a Git Hub repository.
Open Datasets	Yes	We use four real datasets: Kddcup (Stolfo et al., 1999), Covertype (Blackard, 1998), Census (Meek et al., 2001), and Fashion-MNIST (Xiao et al., 2017).
Dataset Splits	No	The paper mentions varying the 'target coreset size s' and evaluating 'the clustering cost on the full dataset,' but does not specify any training/test/validation splits for the datasets used in the experiments. It evaluates performance directly on the entire dataset or a reduced coreset.
Hardware Specification	Yes	All the experiments are run on a Mac Book Air 15.3 with an Apple M3 chip (8 cores, 2.22 GHz), 16GB RAM, and mac OS 14.4.1 (23E224).
Software Dependencies	Yes	All algorithms are implemented in C++ and compiled with Apple Clang version 15.0.0 at -O3 optimization level.
Experiment Setup	Yes	For all experiments, we set the number of centers k n, where n denotes the size of the dataset. For coreset baselines, we vary the target coreset size s (ranging from k up to 30k), compute each baseline coreset with the target size s, and evaluate the clustering cost on the full dataset. We report both the running time and the evaluated cost, averaged over 3 independent trials.