reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A determinantal point process for column subset selection

Authors: Ayoub Belhadji, Rémi Bardenet, Pierre Chainais

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments suggest that our bounds are tight, and that our algorithms have comparable performance with the double phase algorithm, often considered the practical state-of-the-art. Keywords: column subset selection, determinantal point process, volume sampling, leverage score sampling, low-rank approximation
Researcher Affiliation	Academia	Ayoub Belhadji EMAIL R emi Bardenet EMAIL Pierre Chainais EMAIL Universit e de Lille, UMR 9189 CRISt AL, F-59000 Lille, France CNRS, UMR 9189, F59000 Lille, France Centrale Lille, F-59000 Lille, France
Pseudocode	Yes	Figure 1: Pseudocode for sampling from a DPP of marginal kernel K. Figure 3: The pseudocode of the algorithm generating a matrix X with prescribed proﬁle of k-leverage scores. Figure 18: The pseudocode of the generator of random valid eigensteps taking as input (ℓ, σ).
Open Source Code	Yes	Finally, we packaged all CSS algorithms in this section in a publicly available Python toolbox5. 5. http://github.com/Ayoub Belhadji/CSSPy
Open Datasets	Yes	Colon genomics 62 2000 (Alon et al., 1999) Leukemia genomics 72 7129 (Golub et al., 1999) Basehock text processing 1993 4862 (Li et al., 2017b) Relathe text processing 1427 4322 (Li et al., 2017b)
Dataset Splits	No	The paper discusses the generation of toy datasets with specific properties and samples of subsets for analysis, but does not provide standard training/validation/test splits for the real datasets used in the experiments.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models or memory.
Software Dependencies	No	The paper mentions 'Python toolbox' for its code and refers to 'DPPy toolbox' but does not specify any version numbers for Python or other key software libraries used for the experiments.
Experiment Setup	Yes	The number of observations is ﬁxed to N = 100, the dimension to d = 20, and the number of selected columns to k {3, 5}. An ensemble of 50 subsets are sampled using each algorithm. We give the corresponding approximation errors yi X ˆw S 2, on Colon and Basehock respectively, for every value of k {10, 15, 20, 25, 30}. double phase with c = 10k OMP-mixcol, consists in outputting the columns S selected by OMP with the target vector z = Xc, where c N(0, Id). The second variant, called OMP-isotropic, consists in regressing z N(0, IN).