NRGBoost: Energy-Based Generative Boosted Trees
Authors: João Bravo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore generative extensions of these popular algorithms with a focus on explicitly modeling the data density (up to a normalization constant), thus enabling other applications besides sampling. As our main contribution we propose an energy-based generative boosting algorithm that is analogous to the secondorder boosting implemented in popular libraries like XGBoost. We show that, despite producing a generative model capable of handling inference tasks over any input variable, our proposed algorithm can achieve similar discriminative performance to GBDT on a number of real world tabular datasets, outperforming alternative generative approaches. At the same time, we show that it is also competitive with neural-network-based models for sampling. ... We evaluate NRGBoost on five tabular datasets from the UCI Machine Learning Repository (Dheeru & Karra Taniskidou, 2017): Abalone (AB), Physicochemical Properties of Protein Tertiary Structure (PR), Adult (AD), Mini Boo NE (MBNE) and Covertype (CT) as well as the California Housing (CH) dataset available through scikit-learn (Pedregosa et al., 2011). We also include a downsampled version of MNIST (by 2x along each dimension), which allows us to visually assess the quality of individual samples, something that is generally difficult with structured tabular data. |
| Researcher Affiliation | Industry | João Bravo Feedzai EMAIL |
| Pseudocode | Yes | In Algorithm 1 we provide a high-level overview of the training loop for NRGBoost. |
| Open Source Code | Yes | Code is available at https://github.com/ajoo/nrgboost. |
| Open Datasets | Yes | We evaluate NRGBoost on five tabular datasets from the UCI Machine Learning Repository (Dheeru & Karra Taniskidou, 2017): Abalone (AB), Physicochemical Properties of Protein Tertiary Structure (PR), Adult (AD), Mini Boo NE (MBNE) and Covertype (CT) as well as the California Housing (CH) dataset available through scikit-learn (Pedregosa et al., 2011). We also include a downsampled version of MNIST (by 2x along each dimension), which allows us to visually assess the quality of individual samples, something that is generally difficult with structured tabular data. |
| Dataset Splits | Yes | Table 5: Dataset Information. We respect the original test sets of each dataset when provided, otherwise we set aside 20% of the original dataset as a test set. 20% of the remaining data is set aside as a validation set used for hyperparameter tuning. ... For the single-variable inference evaluation, the best models are selected by their discriminative performance on a validation set. The entire setup is repeated five times with different cross-validation folds and with different seeds for all sources of randomness. For the Adult and MNIST datasets the test set is fixed but training and validation splits are still rotated. |
| Hardware Specification | Yes | The experiments were run on a Linux machine equipped with an AMD Ryzen 7 7700X 8 core CPU and 32 GB of RAM. The comparisons with TVAE and Tab DDPM additionally made use of a Ge Force RTX 3060 GPU with 12 GB of VRAM. |
| Software Dependencies | No | Our implementation of the proposed tree-based methods is mostly Python code using the Num Py library (Harris et al., 2020) and Numba. We implement the tree evaluation and Gibbs sampling in C, making use of the PCG library (O Neill, 2014) for random number generation. |
| Experiment Setup | Yes | We use random search to tune the hyperparameters of XGBoost and NGBoost and a grid search to tune the most important hyperparameters of each generative density model. We employ 5-fold crossvalidation, repeating the hyperparameter tuning on each fold. For the full details of the experimental protocol please refer to Appendix D. ... Appendix D.3 Hyperparameter Tuning |