reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Kernel Density Integral Transformation

Authors: Calvin McCarter

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we propose the use of the kernel density integral transformation as a feature preprocessing step. Our approach subsumes the two leading feature preprocessing methods as limiting cases: linear min-max scaling and quantile transformation. We demonstrate that, without hyperparameter tuning, the kernel density integral transformation can be used as a simple drop-in replacement for either method, oﬀering protection from the weaknesses of each. Alternatively, with tuning of a single continuous hyperparameter, we frequently outperform both of these methods. Finally, we show that the kernel density transformation can be proﬁtably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.
Researcher Affiliation	Industry	Calvin Mc Carter EMAIL Boston, MA
Pseudocode	No	The paper describes the approach in prose, outlining steps for correlation analysis and univariate clustering, but does not present them in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Our software package, implemented in Python/Numba with a Scikit-learn (Pedregosa et al., 2011) compatible API, is available at https://github.com/calvinmccarter/kditransform.
Open Datasets	Yes	We ﬁrst replicate the experimental setup of (Raschka, 2014), analyzing the eﬀect of feature preprocessing methods on a simple Naive Bayes classiﬁer for the Wine dataset (Forina et al., 1988). In addition to min-max scaling, used in (Raschka, 2014), we also try quantile transformation and our proposed KD-integral transformation approach. [...] We repeat the above experimental setup for 3 more popular tabular classiﬁcation datasets: Iris (Fisher, 1936), Penguins (Gorman et al., 2014), Hawks (Cannon et al., 2019) [...]. We compare the diﬀerent methods on two standard regression problems, California Housing (Pace & Barry, 1997) and Abalone (Nash et al., 1995).
Dataset Splits	Yes	evaluated via a 70-30 train-test split. [...] for each preprocessing method, we optimized the regularization hyperparameter C œ {10 4, 10 3, . . . , 102}, evaluating each method via one-vs-rest-weighted ROC AUC, averaged over 4 stratiﬁed cross-validation folds.
Hardware Specification	Yes	Runtime was measured on a machine with a 2.8 GHz Core i5 processor.
Software Dependencies	No	Our software package, implemented in Python/Numba with a Scikit-learn (Pedregosa et al., 2011) compatible API, is available at https://github.com/calvinmccarter/kditransform. While Python, Numba, and Scikit-learn are mentioned, specific version numbers are not provided.
Experiment Setup	Yes	For the KD-integral transformation, we show results for the default bandwidth factor of 1, for a bandwidth factor chosen via inner 30-fold cross-validation, and for a sweep of bandwidth factors between 0.1 and 10. [...] for each preprocessing method, we optimized the regularization hyperparameter C œ {10 4, 10 3, . . . , 102}, evaluating each method via one-vs-rest-weighted ROC AUC, averaged over 4 stratiﬁed cross-validation folds.