Gaussian Process Boosting

Authors: Fabio Sigrist

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We obtain increased prediction accuracy compared to existing approaches on multiple simulated and real-world data sets. Keywords: non-linear mixed effects models, mixed effects machine learning, grouped random effects, longitudinal data, spatial and spatio-temporal data, tree-boosting with high-cardinality categorical variables
Researcher Affiliation Academia Fabio Sigrist EMAIL Lucerne University of Applied Sciences and Arts Suurstoffi1 6343 Rotkreuz, Switzerland
Pseudocode Yes Algorithm 1: GPBoost: Gaussian Process Boosting... Algorithm 2: GPBoost OOS: Gaussian Process Boosting with Out-Of-Sample covariance parameter estimation
Open Source Code Yes The GPBoost algorithm is implemented in the GPBoost library written in C++ with a C application programming interface (API) and corresponding Python and R packages. See https://github.com/fabsig/GPBoost for more information.
Open Datasets Yes We use panel data from the National Longitudinal Survey of Young Working Women... it can be downloaded from https://www.stata-press.com/data/r10/nlswork.dta. ... This data is available in the sp Data R package (Bivand et al., 2008)... We compare the GPBoost algorithm to independent boosting and Gaussian process regression using several UCI data set repository10 benchmark data sets.10 http://archive.ics.uci.edu/ml/index.php
Dataset Splits Yes We simulate 100 times both a training data set of size n and two different test data sets each also of size n. ... We compare the prediction accuracy of different approaches using nested 4-fold cross-validation. Specifically, all observations are partitioned into four disjoint sets, and, in every fold, one of the sets is used as test data and the remaining data is used for training. ... Prediction accuracy is evaluated by partitioning the data into expanding window training data sets and temporal out-of-sample test data sets. Specifically, learning is done on an expanding window containing all data up to the year t 1, and predictions are calculated for the next year t. We use the three years t {1996, 1997, 1998} as test data. ... We use 10-fold cross-validation to analyze the prediction accuracy.
Hardware Specification Yes All calculations are done on a laptop with a 2.9 GHz quad-core processor and 16 GB of random-access memory (RAM).
Software Dependencies Yes Learning and prediction with the GPBoost and GPBoost OOS algorithms... is done using the GPBoost library version 0.7.8 compiled with the MSVC compiler version 19.24.28315.0 and Open MP version 2.0. ... For the mboost algorithm, we use the mboost R package (Hofner et al., 2014) version 2.9-2... For the MERF algorithm, we use the merf Python package version 0.33. ... For the REEMtree algorithm, we use the REEMtree R package version 0.90.3. ... For the Cat Boost algorithm, we use the Cat Boost library version 1.0.6.
Experiment Setup Yes For every boosting algorithm (LSBoost, mboost, Cat Boost, GPBoost, and GPBoost OOS), we consider the following candidate tuning parameters: the number of boosting iterations M {1, . . . , 1000}, the learning rate ν {0.1, 0.05, 0.01}, the maximal tree depth {1, 5, 10}, and the minimal number of samples per leaf {1, 10, 100}. For the MERF algorithm, we choose the proportion of variables considered for making splits {0.5, 0.75, 1}. As in Hajjem et al. (2014), we do not impose a maximal tree depth limit and set the number of trees to 300. For the REEMtree package, which relies on the rpart R package, trees are cost-complexity pruned and the amount of pruning is chosen using 10-fold cross-validation on the training data.