Bags of Projected Nearest Neighbours: Competitors to Random Forests?

Authors: David P. Hofmeyr

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results are presented documenting the strong performance of the proposed approach in comparison with Random Forest classifiers, as well as other nearest neighbours based ensembles from the literature, plus other relevant benchmarks. Code to implement the proposed approach is available in the form of an R package from https://github.com/David Hofmeyr/BOPNN. In Section 4 we document the results from experiments using all 162 data sets in the Penn Machine Learning Benchmarks repository (Olson et al., 2017, PMLB).
Researcher Affiliation Academia David P. Hofmeyr EMAIL School of Mathematical Sciences Lancaster University, United Kingdom
Pseudocode Yes 3.2 A Full Model Here we give a brief overview of our approach, which follows a standard format for the bagging framework. With settings for k N, the number of neighbours; q0 [d], the number of randomly selected covariates used in a model; and q [q0], the dimension of each discriminant subspace, do the following for each b [B]: 1. Draw Db = {(x1b, y1b), ..., (xnb B, ynb B)} without replacement from D. 2. Draw Qb of size q0 without replacement from [d]. 3. For each i [n B] find ib k and i b k , the k-th inand out-of-class neighbours of I Qbxib from among {I Qbxjb}j [n B],j =i. 4. Compute the discriminant matrix, i=1 (xib xib k)(xib xib k) IQb i=1 (xib xi b k )(xib xi b k ) IQb and let ˆ b = UbΛb U 1 b be its eigen-decomposition. 5. Form g(x| Db) based on the class labels of the k nearest neighbours of U b,1:qx from {U b,1:qx1b, ..., U b,1:qxnb B}, either by taking the proportions in each class or the indicator vector for the most frequent class. Here we have used Ub,1:q Rd q to denote the first q columns of Ub. Then obtain the final prediction by taking the mode of 1 B PB b=1 g(x| Db).
Open Source Code Yes Code to implement the proposed approach is available in the form of an R package from https://github.com/David Hofmeyr/BOPNN.
Open Datasets Yes For our experiments we considered all 162 classification data sets in the Penn Machine Learning Benchmarks database (Olson et al., 2017). We repeatedly sampled training and test sets from each data set, with training sets constituting 70% and test sets the remaining 30%. The one exception is that, due to the very large amount of compute time required for all experiments, training sets were capped at 7000 points and test sets at 3000 points (however, a different total 10 000 points was used in each training/test combination, where relevant).
Dataset Splits Yes We repeatedly sampled training and test sets from each data set, with training sets constituting 70% and test sets the remaining 30%. The number of times sampling training/test splits varied by overall sample size, n, as follows: 0 < n < 500 : 50 times; 500 n < 1000 : 20 times; 1000 n < 5000 : 10 times; and 5000 n : 5 times. The one exception is that, due to the very large amount of compute time required for all experiments, training sets were capped at 7000 points and test sets at 3000 points (however, a different total 10 000 points was used in each training/test combination, where relevant).
Hardware Specification No The paper discusses running times of algorithms and states: "our implementation of BOPNN has not been optimized for speed, and runs in serial on a single thread. On the other hand the RF implementation in the package random Forest SRC is highly optimized and parallelized." However, it does not specify any particular hardware models (e.g., CPU, GPU, memory) used for these experiments.
Software Dependencies Yes RF: The random forest classifier, as implemented in the R (R Core Team, 2024) package random Forest SRC (Ishwaran & Kogalur, 2019), available on CRAN... SVM: The Support Vector Machine classifier... We used the Liquid SVM implementation (Steinwart & Thomann, 2017).
Experiment Setup Yes 1. BOPNN: The proposed approach (Bag Of Projected Nearest Neighbours), where each ensemble comprised a bag of 100 k NN models (i.e., B = 100). For each data set and training sample, thirty values for k (the number of neighbours); q0 (the size of the random subset of covariates sampled for each model); and q (the number eigenvectors of each ˆ b retained) were sampled uniformly as k U({1, ..., 5}); q0 U({ p1/2 , ..., min{ 10p1/2 , p}}); and q|q0 U({ 0.5q0 , q0 1}) respectively. Models were fit for each setting of these hyperparameters and with the size of each bootstrap sample being 0.63 times the size of the training set (πB = 0.63). The model with the highest Out-Of-Bag estimate for classification accuracy was then applied on the test set. 7. RF: The random forest classifier... hyperparameter selection was conducted from 30 random selections based on OOB performance. The parameters which were tuned are mtry (the number of randomly selected covariates selected as candidates for each split in a tree), which was selected from the interval [0.1p1/2, min{p, 10p1/2}]; and the minimum size of a leaf node in a tree, from {1, ..., 10}. 9. RDA: Regularised Discriminant Analysis (Friedman, 1989)... where λ and γ are hyperparameters which must be chosen, and for which we used 5-fold cross-validation.