reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Case for Library-Level $k$-Means Binning in Histogram Gradient-Boosted Trees

Authors: Asher Labovich

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test this swap against quantile and uniform binning on 33 Open ML datasets plus synthetics that control for modality, skew, and bin budget. Across 18 regression datasets, k-means shows no statistically significant losses at the 5% level and wins in three cases most strikingly a 55% MSE drop on one particularly skewed dataset even though k-means mean reciprocal rank (MRR) is slightly lower (0.65 vs 0.72). On the 15 classification datasets the two methods are statistically tied (MRR 0.70 vs 0.68) with gaps 0.2 pp. Synthetic experiments confirm consistently large MSE gains typically >20% and rising to 90% as outlier magnitude increases or bin budget drops.
Researcher Affiliation	Academia	Asher Labovich EMAIL Department of Applied Mathematics, Brown University
Pseudocode	Yes	Algorithm 1 Make Synth(nobs, nfeat, nmodes, dist, pout, β)
Open Source Code	Yes	An anonymized reproducibility package source code, logs, and result tables is provided for reviewers (see the Links section).
Open Datasets	Yes	To ensure replicable results we evaluate our models on the Open ML (Vanschoren et al., 2014) benchmark suite described in Grinsztajn et al. (2022) (study_id 336 for regression, 337 for binary classification), dropping one task from each track (Higgs, Zurich Delays) due to computational constraints. The remaining 18 regression and 15 classification tasks span 103 106 instances and 2 420 numeric features. Appendix B lists observations and features for each dataset.
Dataset Splits	Yes	Each experiment is repeated over 20 random train/test splits (80/20) to estimate variability.
Hardware Specification	Yes	All real-world experiments were conducted on an academic slurm cluster, on 48-core Intel Xeon Platinum 8268 CPUs @ 2.90 GHz. The experiments ran in 120h wall-clock, doing 1792 core-hours of work, and peaking at 7.7GB memory. Additional exploratory runs on the same cluster amounted to no more than 10k CPU-hours ( 6 the final sweep), as confirmed from Slurm accounting over the project period. All synthetic experiments were executed on a Mac Book Pro (Apple M1 Pro, 8 threads, 16 GB RAM, mac OS 15.0).
Software Dependencies	No	The paper mentions 'scikit-learn vanilla Gradient Boosting Regressor/Classifier', 'scipy.stats.skew', and cites 'Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python' (Virtanen et al., 2020), but it does not explicitly provide specific version numbers for the key software components used in the experiments. For example, it does not state 'scikit-learn X.Y.Z'.
Experiment Setup	Yes	For every (dataset, binning, learner) triple we run a 30-trial Randomized Search CV with 5-fold CV on hyperparameters shown in Appendix C. Each experiment is repeated over 20 random train/test splits (80/20) to estimate variability. We keep the scikit-learn defaults (100 trees, learning-rate = 0.01) for consistency, and simply raise the depth to 5 and set subsample = 0.8 to give the model adequate capacity and standard stochastic regularization without masking the binning effects.