A Case for Library-Level $k$-Means Binning in Histogram Gradient-Boosted Trees

Authors: Asher Labovich

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test this swap against quantile and uniform binning on 33 Open ML datasets plus synthetics that control for modality, skew, and bin budget. Across 18 regression datasets, k-means shows no statistically significant losses at the 5% level and wins in three cases most strikingly a 55% MSE drop on one particularly skewed dataset even though k-means mean reciprocal rank (MRR) is slightly lower (0.65 vs 0.72). On the 15 classification datasets the two methods are statistically tied (MRR 0.70 vs 0.68) with gaps 0.2 pp. Synthetic experiments confirm consistently large MSE gains typically >20% and rising to 90% as outlier magnitude increases or bin budget drops.
Researcher Affiliation Academia Asher Labovich EMAIL Department of Applied Mathematics, Brown University
Pseudocode Yes Algorithm 1 Make Synth(nobs, nfeat, nmodes, dist, pout, β)
Open Source Code Yes An anonymized reproducibility package source code, logs, and result tables is provided for reviewers (see the Links section).
Open Datasets Yes To ensure replicable results we evaluate our models on the Open ML (Vanschoren et al., 2014) benchmark suite described in Grinsztajn et al. (2022) (study_id 336 for regression, 337 for binary classification), dropping one task from each track (Higgs, Zurich Delays) due to computational constraints. The remaining 18 regression and 15 classification tasks span 103 106 instances and 2 420 numeric features. Appendix B lists observations and features for each dataset.
Dataset Splits Yes Each experiment is repeated over 20 random train/test splits (80/20) to estimate variability.
Hardware Specification Yes All real-world experiments were conducted on an academic slurm cluster, on 48-core Intel Xeon Platinum 8268 CPUs @ 2.90 GHz. The experiments ran in 120h wall-clock, doing 1792 core-hours of work, and peaking at 7.7GB memory. Additional exploratory runs on the same cluster amounted to no more than 10k CPU-hours ( 6 the final sweep), as confirmed from Slurm accounting over the project period. All synthetic experiments were executed on a Mac Book Pro (Apple M1 Pro, 8 threads, 16 GB RAM, mac OS 15.0).
Software Dependencies No The paper mentions 'scikit-learn vanilla Gradient Boosting Regressor/Classifier', 'scipy.stats.skew', and cites 'Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python' (Virtanen et al., 2020), but it does not explicitly provide specific version numbers for the key software components used in the experiments. For example, it does not state 'scikit-learn X.Y.Z'.
Experiment Setup Yes For every (dataset, binning, learner) triple we run a 30-trial Randomized Search CV with 5-fold CV on hyperparameters shown in Appendix C. Each experiment is repeated over 20 random train/test splits (80/20) to estimate variability. We keep the scikit-learn defaults (100 trees, learning-rate = 0.01) for consistency, and simply raise the depth to 5 and set subsample = 0.8 to give the model adequate capacity and standard stochastic regularization without masking the binning effects.