Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods
Authors: Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we conduct the first large-scale experimental study of six LP-based boosting formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20 diverse datasets. We evaluate the use of both heuristic and optimal base learners within these formulations, and analyze not only accuracy, but also ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity. |
| Researcher Affiliation | Academia | Fabian Akkerman EMAIL Industrial Engineering and Management Science University of Twente Julien Ferry EMAIL CIRRELT & SCALE-AI Chair in Data-Driven Supply Chains Polytechnique Montréal Christian Artigues EMAIL LAAS-CNRS Université de Toulouse Emmanuel Hébrard EMAIL LAAS-CNRS Université de Toulouse Thibaut Vidal EMAIL CIRRELT & SCALE-AI Chair in Data-Driven Supply Chains Polytechnique Montréal |
| Pseudocode | Yes | Algorithm 1 High-level totally corrective boosting algorithm |
| Open Source Code | Yes | The full source code is available under an MIT license at https://github.com/frakkerman/ colboost. All methods are implemented in the unified, user-friendly colboost Python library to promote more systematic empirical comparisons between totally corrective methods and facilitate future research in this field, see https://pypi.org/project/colboost/. |
| Open Datasets | Yes | Furthermore, all experimental results and datasets used in this study can be found at https://doi.org/10.4121/ f82dcdaa-fc94-43c5-b66d-02579bd3de4f. To obtain robust and generalizable insights into boosting with column generation methods, we conduct an extensive empirical study across 20 diverse datasets from widely used benchmark repositories in machine learning. These datasets were selected to ensure diversity in terms of size, feature dimensionality, and class balance. A complete description of the datasets is given in Appendix A. We study 13 datasets commonly used in related works (Bi et al., 2004; Rätsch et al., 2007; Warmuth et al., 2008; Shen & Li, 2010; Mitsuboshi et al., 2022) which were proposed as binary classification benchmark by Rätsch et al. (2001) and originally sourced from UCI (Kelly et al., 2025), DELVE (Rasmussen et al., 1996) and STAT-LOG (Feng et al., 1993). We retrieved some datasets from the KEEL repository (Alcalá-Fdez et al., 2009; 2011). |
| Dataset Splits | Yes | Each dataset is split into 60% for training, 20% for validation, and 20% for testing. We select the bestperforming hyperparameter value in terms of accuracy using the validation data, and report only the results on the testing data. For all results, we report averages over five random seeds, accounting for variability in data splits and method-specific randomness. |
| Hardware Specification | Yes | We run individual experiments on a single thread of a compute node equipped with AMD Genoa 9654 cores @2.4GHz along with 2 GB of memory per thread. |
| Software Dependencies | Yes | All code is written in Python and experiments are executed on a Linux cluster using Python 3.11. Mathematical programs are solved using Gurobi 10.0.1. We rely on the scikit-learn library (Pedregosa et al., 2011) to build CART trees and fit Adaboost ensembles, and use the Blossom library for optimal decision trees (Demirović et al., 2023). XGBoost and Light GBM ensembles are fitted using their respective Python libraries, see Chen & Guestrin (2016) and Ke et al. (2017), respectively. |
| Experiment Setup | Yes | To ensure a fair comparison between the different methods, we conduct hyperparameter tuning for which we vary the trade-off parameter C (totally corrective methods) and the learning rate (heuristic boosting methods) across 10 possible values. Appendix C details the hyperparameter ranges considered for each method, consistent with the original papers recommendations and refined based on our preliminary experiments. As each method involves tuning a single hyperparameter only, an exhaustive sweep over a fixed set of values is straightforward and ensures that we reliably measure performance across the relevant range. We impose a time limit of 45 minutes and an iteration limit of 100 for all the methods. Early stopping is explicitly disabled for totally corrective approaches to ensure that all boosting strategies are evaluated under a comparable number of iterations. |