Determinantal Point Processes for Coresets

Authors: Nicolas Tremblay, Simon Barthelmé, Pierre-Olivier Amblard

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our results to both the k-means and the linear regression problems, and give extensive empirical evidence that the small additional computational cost of DPP sampling comes with superior performance over its iid counterpart.
Researcher Affiliation Academia CNRS, Univ. Grenoble Alpes, Grenoble INP, GIPSA-lab, Grenoble, France
Pseudocode Yes Algorithm 1 The Gaussian kernel coreset sampling heuristics Algorithm 2 The Vandermonde-based coreset sampling heuristics Algorithm 3 Efficient J-DPP sampling algorithm with projective L-ensemble P = WW
Open Source Code Yes Finally, a Julia toolbox called DPP4Coresets is available on the authors website.1 The DPP4Coresets toolbox is also available at https://gricad-gitlab.univ-grenoble-alpes.fr/ tremblan/dpp4coresets.jl .
Open Datasets Yes The MNIST data set (Le Cun, 1998) We also perform experiments on the 1990 US Census data set7 (downloaded from https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)).
Dataset Splits No The paper uses generated data (Stochastic Block Model graphs) or existing datasets (MNIST, US Census) for experiments to evaluate coreset properties and k-means performance, but does not provide specific training/test/validation splits for model training or evaluation. For example, for MNIST, it describes classifying digits but doesn't mention how the dataset was partitioned into train/test sets for this classification task.
Hardware Specification Yes Experiments were made on a laptop with 8 cores and 16 GB of memory, with the Julia toolbox available on the authors s website.8
Software Dependencies No The paper mentions a "Julia toolbox called DPP4Coresets" and "DPP.jl" as well as "Python toolbox DPPy". While these are specific software tools, no version numbers for Julia, Python, or the toolboxes themselves are provided, which is necessary for reproducibility.
Experiment Setup Yes To measure the performance of each method, we will empirically estimate the probability that, given the method s sampled weighted subset, it verifies the coreset property of Eq. (4) for a given randomly chosen θ (setting ϵ to 0.1). For m-DPP, several values of τ were tried, and we show here the result obtained for τ = 1.5. Also, a number r = 200 of Fourier features were used. For m-DPP, τ was set to 70 (the mean interdistance estimated on 1000 randomly chosen pairs of datapoints), and a number r = 30 of Fourier features was chosen.