Training Gaussian Mixture Models at Scale via Coresets

Authors: Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluation on several real-world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error. Keywords: Gaussian mixture models, coresets, streaming and distributed computation
Researcher Affiliation Academia Mario Lucic EMAIL Department of Computer Science ETH Zurich Universit atstrasse 6, 8092 Z urich, Switzerland. Matthew Faulkner EMAIL Department of Electrical Engineering and Computer Sciences Caltech 1200 E California Blvd, Pasadena, California 91125. Andreas Krause EMAIL Department of Computer Science ETH Zurich Universit atstrasse 6, 8092 Z urich, Switzerland. Dan Feldman EMAIL Department of Computer Science University of Haifa 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel.
Pseudocode Yes Algorithm 1 Coreset. Algorithm 2 K-Means++. Algorithm 3 Adaptive sampling. Algorithm 4 EM for GMMs. Algorithm 5 Expectation. Algorithm 6 Maximization.
Open Source Code No The paper includes a license for the paper itself, but does not provide any concrete access information (link to repository, explicit statement of code release for the methodology described) for source code.
Open Datasets Yes 1. Higgs. Contains 11 000 000 instances describing signal processes which produce Higgs bosons and background processes which do not (Baldi et al., 2014). [...] 2. csn. Contains 80 000 instances with 17 features extracted from acceleration data recorded from volunteers carrying and operating their phone in normal conditions (Faulkner et al., 2011).
Dataset Splits Yes For each data set we use 80% of the data for training and the remaining 20% for computing the error.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies or version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes We stop iterating between EM steps if the number of iterations is greater than 100, or the relative log-likelihood changed is smaller than 10-3 and apply prior thresholding with λ = 0.001 (Section 4).