Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Fast and Accurate $k$-means++ via Rejection Sampling

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we empirically validate out theoretical results by comparing our algorithms FASTKMEANS++ and REJECTIONSAMPLING (see the details on how we set the parameters for LSH in the full version ) with the following two baselines: K-MEANS++ algorithm: Perhaps the most commonly used algorithm in this ๏ฌeld. It samples k points according to the D2-distribution. AFKMC2 algorithm: A recent result [4] based on random walks that improves the running time of the K-MEANS++ algorithm while maintaining a (weaker) theoretical guarantee on the solution quality.
Researcher Affiliation Collaboration Vincent Cohen-Addad Google Research EMAIL Silvio Lattanzi Google Research EMAIL Ashkan Norouzi-Fard Google Research EMAIL Christian Sohler University of Cologne EMAIL Ola Svensson EPFL EMAIL
Pseudocode Yes Algorithm 1 MULTITREEOPEN, Algorithm 2 MULTITREESAMPLE, Algorithm 3 FASTK-MEANS++, Algorithm 4 REJECTIONSAMPLING
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing their source code for the described methodology, nor does it provide a direct link to a code repository for their proposed algorithms.
Open Datasets Yes We ran our algorithms on three classic datasets from UCI library [15]: KDD-Cup [12] (311, 029 points of dimension 74) and song [8] (515, 345 points of dimension 90) Census [19] (2, 458, 285 points of dimension 68).
Dataset Splits No The paper mentions using datasets but does not specify any explicit train/validation/test splits, percentages, or methodology for partitioning the data, nor does it reference predefined splits with citations.
Hardware Specification No The paper states 'The algorithms were run on a standard desktop computer' but provides no specific details regarding the CPU, GPU, memory, or any other hardware components used for the experiments.
Software Dependencies No The paper does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, or specific solvers with versions) that would be needed to replicate the experiments.
Experiment Setup No The paper mentions using the AFKMC2 algorithm with a parameter 'm = 200', but it does not provide specific hyperparameter values or detailed training configurations for its own proposed algorithms (FASTK-MEANS++ and REJECTIONSAMPLING) within the main text.