reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fast and Accurate $k$-means++ via Rejection Sampling

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we empirically validate out theoretical results by comparing our algorithms FASTKMEANS++ and REJECTIONSAMPLING (see the details on how we set the parameters for LSH in the full version ) with the following two baselines: K-MEANS++ algorithm: Perhaps the most commonly used algorithm in this ﬁeld. It samples k points according to the D2-distribution. AFKMC2 algorithm: A recent result [4] based on random walks that improves the running time of the K-MEANS++ algorithm while maintaining a (weaker) theoretical guarantee on the solution quality.
Researcher Affiliation	Collaboration	Vincent Cohen-Addad Google Research EMAIL Silvio Lattanzi Google Research EMAIL Ashkan Norouzi-Fard Google Research EMAIL Christian Sohler University of Cologne EMAIL Ola Svensson EPFL EMAIL
Pseudocode	Yes	Algorithm 1 MULTITREEOPEN, Algorithm 2 MULTITREESAMPLE, Algorithm 3 FASTK-MEANS++, Algorithm 4 REJECTIONSAMPLING
Open Source Code	No	The paper does not include an unambiguous statement that the authors are releasing their source code for the described methodology, nor does it provide a direct link to a code repository for their proposed algorithms.
Open Datasets	Yes	We ran our algorithms on three classic datasets from UCI library [15]: KDD-Cup [12] (311, 029 points of dimension 74) and song [8] (515, 345 points of dimension 90) Census [19] (2, 458, 285 points of dimension 68).
Dataset Splits	No	The paper mentions using datasets but does not specify any explicit train/validation/test splits, percentages, or methodology for partitioning the data, nor does it reference predefined splits with citations.
Hardware Specification	No	The paper states 'The algorithms were run on a standard desktop computer' but provides no specific details regarding the CPU, GPU, memory, or any other hardware components used for the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, or specific solvers with versions) that would be needed to replicate the experiments.
Experiment Setup	No	The paper mentions using the AFKMC2 algorithm with a parameter 'm = 200', but it does not provide specific hyperparameter values or detailed training configurations for its own proposed algorithms (FASTK-MEANS++ and REJECTIONSAMPLING) within the main text.