Gaussian Process Boosting
Authors: Fabio Sigrist
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We obtain increased prediction accuracy compared to existing approaches on multiple simulated and real-world data sets. Keywords: non-linear mixed effects models, mixed effects machine learning, grouped random effects, longitudinal data, spatial and spatio-temporal data, tree-boosting with high-cardinality categorical variables |
| Researcher Affiliation | Academia | Fabio Sigrist EMAIL Lucerne University of Applied Sciences and Arts Suurstoffi1 6343 Rotkreuz, Switzerland |
| Pseudocode | Yes | Algorithm 1: GPBoost: Gaussian Process Boosting... Algorithm 2: GPBoost OOS: Gaussian Process Boosting with Out-Of-Sample covariance parameter estimation |
| Open Source Code | Yes | The GPBoost algorithm is implemented in the GPBoost library written in C++ with a C application programming interface (API) and corresponding Python and R packages. See https://github.com/fabsig/GPBoost for more information. |
| Open Datasets | Yes | We use panel data from the National Longitudinal Survey of Young Working Women... it can be downloaded from https://www.stata-press.com/data/r10/nlswork.dta. ... This data is available in the sp Data R package (Bivand et al., 2008)... We compare the GPBoost algorithm to independent boosting and Gaussian process regression using several UCI data set repository10 benchmark data sets.10 http://archive.ics.uci.edu/ml/index.php |
| Dataset Splits | Yes | We simulate 100 times both a training data set of size n and two different test data sets each also of size n. ... We compare the prediction accuracy of different approaches using nested 4-fold cross-validation. Specifically, all observations are partitioned into four disjoint sets, and, in every fold, one of the sets is used as test data and the remaining data is used for training. ... Prediction accuracy is evaluated by partitioning the data into expanding window training data sets and temporal out-of-sample test data sets. Specifically, learning is done on an expanding window containing all data up to the year t 1, and predictions are calculated for the next year t. We use the three years t {1996, 1997, 1998} as test data. ... We use 10-fold cross-validation to analyze the prediction accuracy. |
| Hardware Specification | Yes | All calculations are done on a laptop with a 2.9 GHz quad-core processor and 16 GB of random-access memory (RAM). |
| Software Dependencies | Yes | Learning and prediction with the GPBoost and GPBoost OOS algorithms... is done using the GPBoost library version 0.7.8 compiled with the MSVC compiler version 19.24.28315.0 and Open MP version 2.0. ... For the mboost algorithm, we use the mboost R package (Hofner et al., 2014) version 2.9-2... For the MERF algorithm, we use the merf Python package version 0.33. ... For the REEMtree algorithm, we use the REEMtree R package version 0.90.3. ... For the Cat Boost algorithm, we use the Cat Boost library version 1.0.6. |
| Experiment Setup | Yes | For every boosting algorithm (LSBoost, mboost, Cat Boost, GPBoost, and GPBoost OOS), we consider the following candidate tuning parameters: the number of boosting iterations M {1, . . . , 1000}, the learning rate ν {0.1, 0.05, 0.01}, the maximal tree depth {1, 5, 10}, and the minimal number of samples per leaf {1, 10, 100}. For the MERF algorithm, we choose the proportion of variables considered for making splits {0.5, 0.75, 1}. As in Hajjem et al. (2014), we do not impose a maximal tree depth limit and set the number of trees to 300. For the REEMtree package, which relies on the rpart R package, trees are cost-complexity pruned and the amount of pruning is chosen using 10-fold cross-validation on the training data. |