Scalable Gaussian-process regression and variable selection using Vecchia approximations

Authors: Jian Cao, Joseph Guinness, Marc G. Genton, Matthias Katzfuss

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Theoretical analysis and numerical studies demonstrate the improved scalability and accuracy relative to existing methods. In Section 4, we compare VGPR with state-of-the-art GP regressions in terms of posterior inference and variable selection based on simulated GP datasets. Section 5 provides a comparison with methods commonly used in machine learning for variable selection and prediction based on real datasets, including an example with n = 106 and d = 103.
Researcher Affiliation Academia Jian Cao EMAIL Department of Statistics and Institute of Data Science Texas A&M University College Station, TX 77843, USA Joseph Guinness EMAIL Department of Statistics Cornell University Ithaca, NY 14853, USA Marc G. Genton EMAIL Statistics Program King Abdullah University of Science and Technology Thuwal 23955-6900, Saudi Arabia Matthias Katzfuss EMAIL Department of Statistics Texas A&M University College Station, TX 77843, USA
Pseudocode Yes Algorithm 1: VGPR... Algorithm 2: Forward-backward selection... Algorithm 3: Quadratic constrained coordinate descent (QCCD)... Algorithm 4: Constrained coordinate descent (CCD)
Open Source Code Yes The code for replicating the numerical results in this paper are published at https://github.com/katzfuss-group/Vecchia_GPR_var_select.
Open Datasets Yes The second dataset was the relative location of CT slices on axial axis (Slice) from the UCI Machine Learning Repository (Dua and Graff, 2017) that has n = 53,500 images and d = 386 features. The third dataset was the physicochemical properties of protein tertiary structure (CASP) dataset also from the UCI Repository with n = 45,730 responses and d = 9 features. A fourth dataset was the temperature (Temp) data used in Garnett et al. (2013) that contains 7,117 training samples and 3,558 testing samples, each with d = 106 features.
Dataset Splits Yes 5,000 responses were set aside as the testing dataset used to evaluate the four methods performances. A quarter of the responses were set aside to compute the OOS RMSE based on which, the stopping condition was defined as that the OOS RMSE improves less than 1% after any new covariate is selected. The OOS sample size of n/4 and the 1% OOS score threshold are used throughout this paper and are generally recommended as default values. The second dataset was the relative location of CT slices on axial axis (Slice) from the UCI Machine Learning Repository (Dua and Graff, 2017) that has n = 53,500 images and d = 386 features. The images belong to 74 individuals, among which a quarter were selected as the testing dataset. A fourth dataset was the temperature (Temp) data used in Garnett et al. (2013) that contains 7,117 training samples and 3,558 testing samples, each with d = 106 features.
Hardware Specification Yes The computation times were measured on an Intel Xeon E5-2680 v4 CPU using 56 cores and capped at a 10-hour limit for each GP replicate.
Software Dependencies No For Tree and Lasso , the default setups from the glmnet R package (Friedman et al., 2010) and the sklearn Python module (Pedregosa et al., 2011) were used, respectively. [...] refer to Guinness (2021) and the R package Gp Gp (Guinness, 2018) for the computation details. [...] Based on the GPy Torch Gardner et al. (2018) implementation, when d > 5, the kernel function needed to assume an additive structure to be computationally feasible, for which ARD kernels are yet available, hence we only consider KISS as a state-of-the-art competitor for prediction at unknown locations. [...] For very large n, it is also possible to reduce the cost of MM and NN by replacing them by random ordering and the index-based-on-inverted-file (IVF) method (implemented in the Faiss library of Johnson et al., 2017), respectively. The text mentions software by name and provides citations, but no specific version numbers are given for any of them, which is required for reproducibility.
Experiment Setup Yes The initial values for σ2, {rl}d l=0, and τ 2, when needed, were 0.25, 0.1, and 10 4, respectively, while for PGPR, ten random initial values, as recommended in Yi et al. (2011), were used for the optimization at each λ. The maximum numbers of iterations were 100 for PGPR, Fisher, and FWD, while 200 for VGPR, as the latter used mini-batch subsampling with ˇn = 128 n. The Armijo constant c in Algorithm 3 heuristically prevents overly large steps; we choose c = 10 4 as recommended in Chapter 3 of Wright et al. (1999) and used in the Gp Gp R package Guinness (2021). The number of new covariates selected each iteration (k) and the penalty parameters κ and γ in (8) are unique to our proposed VGPR algorithm and iterative adaptive bridge penalty, and hence they have not been discussed in the existing literature. Here, we provide some recommendations and a sensitivity analysis on them; see Appendix B for more details. Larger k leads to higher optimization efficiency but also the risk of local optima; we recommend a value between 3 and 5. Bigger κ corresponds to weaker numerical singularity at rl = 0. We recommend κ > 0 when mini-batch subsampling is applied and a large κ (e.g., 10 or 15) when the GP with ARD kernels is likely a misspecified model. Smaller γ causes higher difference in the penalty derivatives at small and large rl. Yi et al. (2011) used γ = 0.01, whereas we recommend a choice between 0.1 and 0.25 for a smoother objective function.