reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

New Algorithms for the Learning-Augmented k-means Problem

Authors: Junyu Huang, Qilong Feng, Ziyun Huang, Zhen Zhang, Jinhui Xu, Jianxin Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical experiments show that our proposed methods are faster than the state-of-the-art learning-augmented k-means algorithms with comparable performances on clustering quality. In this section, we give empirical evaluations on the performances of our proposed algorithms.
Researcher Affiliation	Academia	Junyu Huang School of Computer Science and Engineering, Central South University Xiangjiang Laboratory Changsha, China EMAIL; Qilong Feng School of Computer Science and Engineering, Central South University Xiangjiang Laboratory Changsha, China EMAIL; Ziyun Huang Department of Computer Science and Software Engineering, Penn State Erie, The Behrend College Erie, United States EMAIL; Zhen Zhang School of Advanced Interdisciplinary Studies, Hunan University of Technology and Business Xiangjiang Laboratory Changsha, China EMAIL; Jinhui Xu School of Information Science and Technology, University of Science and Technology of China Hefei, Anhui, China EMAIL; Jianxin Wang School of Computer Science and Engineering, Central South University Hunan Provincial Key Lab on Bioinformatics Xiangjiang Laboratory Changsha, China EMAIL
Pseudocode	Yes	Algorithm 1 Fast-Sampling; Algorithm 2 Fast-Estimation; Algorithm 3 Fast-Filtering; Algorithm 4 Fast-Filtering (modiﬁed); Algorithm 5 Fast-Sampling (k-median)
Open Source Code	No	The text does not contain any explicit statements about releasing code, a link to a code repository, or mention of code being available in supplementary materials for the methodology described in the paper.
Open Datasets	Yes	Following the work in Nguyen et al. (2022) and Ergun et al. (2021), we test the algorithms on datasets CIFAR10 (m = 10, 000, d = 3, 072), PHY (m = 10, 000, d = 50) and MNIST (m = 1, 797, d = 64) with varying error rate α and the number of clusters k. We also test the performances of the algorithms on other large datasets from UCI Machine Learning Repository 3 including SUSY (m = 5, 000, 000, d = 18) and HIGGS (m = 11, 000, 000, d = 27), and one large-scale dataset SIFT (m = 100, 000, 000, d = 128) from Matsui et al. (2017). 3https://archive.ics.uci.edu/
Dataset Splits	No	The paper describes how corrupted labeling partitions are generated for the predictors but does not provide specific details on training, validation, or test dataset splits for the main data points, such as percentages or sample counts.
Hardware Specification	Yes	All algorithms are implemented and executed in Python. The experiments were done on a machine with i7-12700KF processor and 256GB RAM.
Software Dependencies	No	The paper only mentions 'Python' without specifying its version or the versions of any other key software libraries or dependencies used for the implementation.
Experiment Setup	Yes	For the Fast-Sampling algorithm, the sample size is set to 4, and we ﬁx ϵ = 1. For the Fast-Filtering and Fast-Estimation algorithms, we ﬁx R1 = 10, R2 = m/20 and ϵ = 0.3, where m is the size of the given clustering instance. For each algorithm (including those in Ergun et al. (2021); Nguyen et al. (2022) as well as ours), we iterate over 15 possible values of α uniformly distributed in the interval [0.01, 0.5] as the inputs for the algorithms.