reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable Gaussian-process regression and variable selection using Vecchia approximations

Authors: Jian Cao, Joseph Guinness, Marc G. Genton, Matthias Katzfuss

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Theoretical analysis and numerical studies demonstrate the improved scalability and accuracy relative to existing methods. In Section 4, we compare VGPR with state-of-the-art GP regressions in terms of posterior inference and variable selection based on simulated GP datasets. Section 5 provides a comparison with methods commonly used in machine learning for variable selection and prediction based on real datasets, including an example with n = 106 and d = 103.
Researcher Affiliation	Academia	Jian Cao EMAIL Department of Statistics and Institute of Data Science Texas A&M University College Station, TX 77843, USA Joseph Guinness EMAIL Department of Statistics Cornell University Ithaca, NY 14853, USA Marc G. Genton EMAIL Statistics Program King Abdullah University of Science and Technology Thuwal 23955-6900, Saudi Arabia Matthias Katzfuss EMAIL Department of Statistics Texas A&M University College Station, TX 77843, USA
Pseudocode	Yes	Algorithm 1: VGPR... Algorithm 2: Forward-backward selection... Algorithm 3: Quadratic constrained coordinate descent (QCCD)... Algorithm 4: Constrained coordinate descent (CCD)
Open Source Code	Yes	The code for replicating the numerical results in this paper are published at https://github.com/katzfuss-group/Vecchia_GPR_var_select.
Open Datasets	Yes	The second dataset was the relative location of CT slices on axial axis (Slice) from the UCI Machine Learning Repository (Dua and Graﬀ, 2017) that has n = 53,500 images and d = 386 features. The third dataset was the physicochemical properties of protein tertiary structure (CASP) dataset also from the UCI Repository with n = 45,730 responses and d = 9 features. A fourth dataset was the temperature (Temp) data used in Garnett et al. (2013) that contains 7,117 training samples and 3,558 testing samples, each with d = 106 features.
Dataset Splits	Yes	5,000 responses were set aside as the testing dataset used to evaluate the four methods performances. A quarter of the responses were set aside to compute the OOS RMSE based on which, the stopping condition was deﬁned as that the OOS RMSE improves less than 1% after any new covariate is selected. The OOS sample size of n/4 and the 1% OOS score threshold are used throughout this paper and are generally recommended as default values. The second dataset was the relative location of CT slices on axial axis (Slice) from the UCI Machine Learning Repository (Dua and Graﬀ, 2017) that has n = 53,500 images and d = 386 features. The images belong to 74 individuals, among which a quarter were selected as the testing dataset. A fourth dataset was the temperature (Temp) data used in Garnett et al. (2013) that contains 7,117 training samples and 3,558 testing samples, each with d = 106 features.
Hardware Specification	Yes	The computation times were measured on an Intel Xeon E5-2680 v4 CPU using 56 cores and capped at a 10-hour limit for each GP replicate.
Software Dependencies	No	For Tree and Lasso , the default setups from the glmnet R package (Friedman et al., 2010) and the sklearn Python module (Pedregosa et al., 2011) were used, respectively. [...] refer to Guinness (2021) and the R package Gp Gp (Guinness, 2018) for the computation details. [...] Based on the GPy Torch Gardner et al. (2018) implementation, when d > 5, the kernel function needed to assume an additive structure to be computationally feasible, for which ARD kernels are yet available, hence we only consider KISS as a state-of-the-art competitor for prediction at unknown locations. [...] For very large n, it is also possible to reduce the cost of MM and NN by replacing them by random ordering and the index-based-on-inverted-ﬁle (IVF) method (implemented in the Faiss library of Johnson et al., 2017), respectively. The text mentions software by name and provides citations, but no specific version numbers are given for any of them, which is required for reproducibility.
Experiment Setup	Yes	The initial values for σ2, {rl}d l=0, and τ 2, when needed, were 0.25, 0.1, and 10 4, respectively, while for PGPR, ten random initial values, as recommended in Yi et al. (2011), were used for the optimization at each λ. The maximum numbers of iterations were 100 for PGPR, Fisher, and FWD, while 200 for VGPR, as the latter used mini-batch subsampling with ˇn = 128 n. The Armijo constant c in Algorithm 3 heuristically prevents overly large steps; we choose c = 10 4 as recommended in Chapter 3 of Wright et al. (1999) and used in the Gp Gp R package Guinness (2021). The number of new covariates selected each iteration (k) and the penalty parameters κ and γ in (8) are unique to our proposed VGPR algorithm and iterative adaptive bridge penalty, and hence they have not been discussed in the existing literature. Here, we provide some recommendations and a sensitivity analysis on them; see Appendix B for more details. Larger k leads to higher optimization eﬃciency but also the risk of local optima; we recommend a value between 3 and 5. Bigger κ corresponds to weaker numerical singularity at rl = 0. We recommend κ > 0 when mini-batch subsampling is applied and a large κ (e.g., 10 or 15) when the GP with ARD kernels is likely a misspeciﬁed model. Smaller γ causes higher diﬀerence in the penalty derivatives at small and large rl. Yi et al. (2011) used γ = 0.01, whereas we recommend a choice between 0.1 and 0.25 for a smoother objective function.