reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Determinantal Point Processes for Coresets

Authors: Nicolas Tremblay, Simon Barthelmé, Pierre-Olivier Amblard

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our results to both the k-means and the linear regression problems, and give extensive empirical evidence that the small additional computational cost of DPP sampling comes with superior performance over its iid counterpart.
Researcher Affiliation	Academia	CNRS, Univ. Grenoble Alpes, Grenoble INP, GIPSA-lab, Grenoble, France
Pseudocode	Yes	Algorithm 1 The Gaussian kernel coreset sampling heuristics Algorithm 2 The Vandermonde-based coreset sampling heuristics Algorithm 3 Eﬃcient J-DPP sampling algorithm with projective L-ensemble P = WW
Open Source Code	Yes	Finally, a Julia toolbox called DPP4Coresets is available on the authors website.1 The DPP4Coresets toolbox is also available at https://gricad-gitlab.univ-grenoble-alpes.fr/ tremblan/dpp4coresets.jl .
Open Datasets	Yes	The MNIST data set (Le Cun, 1998) We also perform experiments on the 1990 US Census data set7 (downloaded from https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)).
Dataset Splits	No	The paper uses generated data (Stochastic Block Model graphs) or existing datasets (MNIST, US Census) for experiments to evaluate coreset properties and k-means performance, but does not provide specific training/test/validation splits for model training or evaluation. For example, for MNIST, it describes classifying digits but doesn't mention how the dataset was partitioned into train/test sets for this classification task.
Hardware Specification	Yes	Experiments were made on a laptop with 8 cores and 16 GB of memory, with the Julia toolbox available on the authors s website.8
Software Dependencies	No	The paper mentions a "Julia toolbox called DPP4Coresets" and "DPP.jl" as well as "Python toolbox DPPy". While these are specific software tools, no version numbers for Julia, Python, or the toolboxes themselves are provided, which is necessary for reproducibility.
Experiment Setup	Yes	To measure the performance of each method, we will empirically estimate the probability that, given the method s sampled weighted subset, it veriﬁes the coreset property of Eq. (4) for a given randomly chosen θ (setting ϵ to 0.1). For m-DPP, several values of τ were tried, and we show here the result obtained for τ = 1.5. Also, a number r = 200 of Fourier features were used. For m-DPP, τ was set to 70 (the mean interdistance estimated on 1000 randomly chosen pairs of datapoints), and a number r = 30 of Fourier features was chosen.