reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training Gaussian Mixture Models at Scale via Coresets

Authors: Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation on several real-world data sets suggests that our coreset-based approach enables signiﬁcant reduction in training-time with negligible approximation error. Keywords: Gaussian mixture models, coresets, streaming and distributed computation
Researcher Affiliation	Academia	Mario Lucic EMAIL Department of Computer Science ETH Zurich Universit atstrasse 6, 8092 Z urich, Switzerland. Matthew Faulkner EMAIL Department of Electrical Engineering and Computer Sciences Caltech 1200 E California Blvd, Pasadena, California 91125. Andreas Krause EMAIL Department of Computer Science ETH Zurich Universit atstrasse 6, 8092 Z urich, Switzerland. Dan Feldman EMAIL Department of Computer Science University of Haifa 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel.
Pseudocode	Yes	Algorithm 1 Coreset. Algorithm 2 K-Means++. Algorithm 3 Adaptive sampling. Algorithm 4 EM for GMMs. Algorithm 5 Expectation. Algorithm 6 Maximization.
Open Source Code	No	The paper includes a license for the paper itself, but does not provide any concrete access information (link to repository, explicit statement of code release for the methodology described) for source code.
Open Datasets	Yes	1. Higgs. Contains 11 000 000 instances describing signal processes which produce Higgs bosons and background processes which do not (Baldi et al., 2014). [...] 2. csn. Contains 80 000 instances with 17 features extracted from acceleration data recorded from volunteers carrying and operating their phone in normal conditions (Faulkner et al., 2011).
Dataset Splits	Yes	For each data set we use 80% of the data for training and the remaining 20% for computing the error.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependencies or version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	We stop iterating between EM steps if the number of iterations is greater than 100, or the relative log-likelihood changed is smaller than 10-3 and apply prior thresholding with λ = 0.001 (Section 4).