The Kernel Density Integral Transformation

Authors: Calvin McCarter

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose the use of the kernel density integral transformation as a feature preprocessing step. Our approach subsumes the two leading feature preprocessing methods as limiting cases: linear min-max scaling and quantile transformation. We demonstrate that, without hyperparameter tuning, the kernel density integral transformation can be used as a simple drop-in replacement for either method, offering protection from the weaknesses of each. Alternatively, with tuning of a single continuous hyperparameter, we frequently outperform both of these methods. Finally, we show that the kernel density transformation can be profitably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.
Researcher Affiliation Industry Calvin Mc Carter EMAIL Boston, MA
Pseudocode No The paper describes the approach in prose, outlining steps for correlation analysis and univariate clustering, but does not present them in a structured pseudocode or algorithm block format.
Open Source Code Yes Our software package, implemented in Python/Numba with a Scikit-learn (Pedregosa et al., 2011) compatible API, is available at https://github.com/calvinmccarter/kditransform.
Open Datasets Yes We first replicate the experimental setup of (Raschka, 2014), analyzing the effect of feature preprocessing methods on a simple Naive Bayes classifier for the Wine dataset (Forina et al., 1988). In addition to min-max scaling, used in (Raschka, 2014), we also try quantile transformation and our proposed KD-integral transformation approach. [...] We repeat the above experimental setup for 3 more popular tabular classification datasets: Iris (Fisher, 1936), Penguins (Gorman et al., 2014), Hawks (Cannon et al., 2019) [...]. We compare the different methods on two standard regression problems, California Housing (Pace & Barry, 1997) and Abalone (Nash et al., 1995).
Dataset Splits Yes evaluated via a 70-30 train-test split. [...] for each preprocessing method, we optimized the regularization hyperparameter C œ {10 4, 10 3, . . . , 102}, evaluating each method via one-vs-rest-weighted ROC AUC, averaged over 4 stratified cross-validation folds.
Hardware Specification Yes Runtime was measured on a machine with a 2.8 GHz Core i5 processor.
Software Dependencies No Our software package, implemented in Python/Numba with a Scikit-learn (Pedregosa et al., 2011) compatible API, is available at https://github.com/calvinmccarter/kditransform. While Python, Numba, and Scikit-learn are mentioned, specific version numbers are not provided.
Experiment Setup Yes For the KD-integral transformation, we show results for the default bandwidth factor of 1, for a bandwidth factor chosen via inner 30-fold cross-validation, and for a sweep of bandwidth factors between 0.1 and 10. [...] for each preprocessing method, we optimized the regularization hyperparameter C œ {10 4, 10 3, . . . , 102}, evaluating each method via one-vs-rest-weighted ROC AUC, averaged over 4 stratified cross-validation folds.