A determinantal point process for column subset selection

Authors: Ayoub Belhadji, Rémi Bardenet, Pierre Chainais

JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments suggest that our bounds are tight, and that our algorithms have comparable performance with the double phase algorithm, often considered the practical state-of-the-art. Keywords: column subset selection, determinantal point process, volume sampling, leverage score sampling, low-rank approximation
Researcher Affiliation Academia Ayoub Belhadji EMAIL R emi Bardenet EMAIL Pierre Chainais EMAIL Universit e de Lille, UMR 9189 CRISt AL, F-59000 Lille, France CNRS, UMR 9189, F59000 Lille, France Centrale Lille, F-59000 Lille, France
Pseudocode Yes Figure 1: Pseudocode for sampling from a DPP of marginal kernel K. Figure 3: The pseudocode of the algorithm generating a matrix X with prescribed profile of k-leverage scores. Figure 18: The pseudocode of the generator of random valid eigensteps taking as input (ℓ, σ).
Open Source Code Yes Finally, we packaged all CSS algorithms in this section in a publicly available Python toolbox5. 5. http://github.com/Ayoub Belhadji/CSSPy
Open Datasets Yes Colon genomics 62 2000 (Alon et al., 1999) Leukemia genomics 72 7129 (Golub et al., 1999) Basehock text processing 1993 4862 (Li et al., 2017b) Relathe text processing 1427 4322 (Li et al., 2017b)
Dataset Splits No The paper discusses the generation of toy datasets with specific properties and samples of subsets for analysis, but does not provide standard training/validation/test splits for the real datasets used in the experiments.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models or memory.
Software Dependencies No The paper mentions 'Python toolbox' for its code and refers to 'DPPy toolbox' but does not specify any version numbers for Python or other key software libraries used for the experiments.
Experiment Setup Yes The number of observations is fixed to N = 100, the dimension to d = 20, and the number of selected columns to k {3, 5}. An ensemble of 50 subsets are sampled using each algorithm. We give the corresponding approximation errors yi X ˆw S 2, on Colon and Basehock respectively, for every value of k {10, 15, 20, 25, 30}. double phase with c = 10k OMP-mixcol, consists in outputting the columns S selected by OMP with the target vector z = Xc, where c N(0, Id). The second variant, called OMP-isotropic, consists in regressing z N(0, IN).