Training Gaussian Mixture Models at Scale via Coresets
Authors: Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman
JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation on several real-world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error. Keywords: Gaussian mixture models, coresets, streaming and distributed computation |
| Researcher Affiliation | Academia | Mario Lucic EMAIL Department of Computer Science ETH Zurich Universit atstrasse 6, 8092 Z urich, Switzerland. Matthew Faulkner EMAIL Department of Electrical Engineering and Computer Sciences Caltech 1200 E California Blvd, Pasadena, California 91125. Andreas Krause EMAIL Department of Computer Science ETH Zurich Universit atstrasse 6, 8092 Z urich, Switzerland. Dan Feldman EMAIL Department of Computer Science University of Haifa 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel. |
| Pseudocode | Yes | Algorithm 1 Coreset. Algorithm 2 K-Means++. Algorithm 3 Adaptive sampling. Algorithm 4 EM for GMMs. Algorithm 5 Expectation. Algorithm 6 Maximization. |
| Open Source Code | No | The paper includes a license for the paper itself, but does not provide any concrete access information (link to repository, explicit statement of code release for the methodology described) for source code. |
| Open Datasets | Yes | 1. Higgs. Contains 11 000 000 instances describing signal processes which produce Higgs bosons and background processes which do not (Baldi et al., 2014). [...] 2. csn. Contains 80 000 instances with 17 features extracted from acceleration data recorded from volunteers carrying and operating their phone in normal conditions (Faulkner et al., 2011). |
| Dataset Splits | Yes | For each data set we use 80% of the data for training and the remaining 20% for computing the error. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies or version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | We stop iterating between EM steps if the number of iterations is greater than 100, or the relative log-likelihood changed is smaller than 10-3 and apply prior thresholding with λ = 0.001 (Section 4). |