Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers

Authors: Abraham J. Wyner, Matthew Olson, Justin Bleich, David Mease

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a number of examples to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees, without regularization or early stopping. We will begin with an easy to visualize example that demonstrates how fine interpolation can provide robustness in a noisy setting. In particular, we compare the performance of Ada Boost, random forests and one-nearest neighbors, which are all interpolating classifiers. We will show that the self-averaging property of Ada Boost and random forests is crucial. This property will be discussed in subsequent sections. The implementation of Ada Boost is carried out according to the algorithm described earlier. The base learners used are trees fit by the rpart package (Therneau and Atkinson, 1997) in R. The trees are grown to a maximum depth of 8, meaning they may have at most 28 = 256 terminal nodes. This will be the implementation of Ada Boost we will consider throughout the remainder of this paper. We will consider again the pure noise model as described in the previous section, where the probability that y is equal to +1 for every x is some constant value p > .5. For the training data we will take n = 400 points uniformly sampled on [0, 1]2 according to a Latin Hypercube using the midpoints as before. For the corresponding y values in training data we will randomly choose 80 points to be 1 s so that P (y = 1|x) = .8. Figure 5 displays the results for the following: (a) one-NN, (b) Ada Boost , and (c) random forests. We now repeat the simulation in Section 3.3 with a larger sample size and in 20 dimensions instead of 2. Specifically, the training data now has n = 5000 observations sampled according to the midpoints of a Latin Hypercube design uniformly on [0, 1]20. We again randomly select 20% or 1000 of these points to be 1 s with the remaining 4000 to be +1 s. Since in 20 dimensions it is difficult to display the resulting classification rules graphically we instead examine the rules on a hold out sample of 10,000 points sampled uniformly and independently on [0, 1]20. Figure 8 plots the proportion of points in the hold out sample classified by Ada Boost as +1 as a function of the number of iterations. We repeated a version of the analysis conducted in Section 5.1 for five additional data sets from the UCI repository: Haberman, Wisconsin breast cancer, voting, Pima, and German credit. In our analysis we did the following: we added 5% label noise to the training data, missing values were imputed with the mean prior to model fitting, and all experiments were conducted on 50 random training-testing splits of each data set.
Researcher Affiliation Collaboration Abraham J. Wyner EMAIL Matthew Olson EMAIL Justin Bleich EMAIL Department of Statistics Wharton School, University of Pennsylvania Philadelphia, PA 19104, USA David Mease EMAIL Apple Inc.
Pseudocode Yes Algorithm 1: Ada Boost Hastie et al. (2009) 1. Initialize the observation weights wi = 1 N , i = 1, 2, . . . , N. 2. For m = 1 to M: (a) Fit a classifier Gm(x) to the training data using weights wi. (b) Compute errm = PN i=1 wi I (yi = Gt (xi)) PN i=1 wi . (c) Compute am = log 1 errt (d) Set wi wi exp (at I (yi = Gt (xi))) (e) Set fi(x) = PM m=1 am Gm (x) 3. Output f(x) = sign (f M(x)) ... Algorithm 2: Random Forests Hastie et al. (2009) 1. For b = 1 to B: (a) Draw a bootstrap sample X of size N from the training data (b) Grow a decision tree Tb to the data X by doing the following recursively until the minimum node size nmin is reached: i. Select m of the p variables ii. Pick the best variable/split-point from the m variables and partition 2. Output the ensemble {Tb}B b
Open Source Code No The paper mentions using the 'rpart package (Therneau and Atkinson, 1997) in R' and 'R implementation random Forest Liaw and Wiener (2002)'. These are third-party tools that were utilized. The authors do not provide a specific link or explicit statement about releasing their own source code for the methodology described in this paper.
Open Datasets Yes The original data can be found at the following address: https://www.elen.ucl.ac.be/neuralnets/Research/Projects/ELENA/databases/REAL/phoneme/phoneme.txt. We repeated a version of the analysis conducted in Section 5.1 for five additional data sets from the UCI repository: Haberman, Wisconsin breast cancer, voting, Pima, and German credit.
Dataset Splits Yes We begin by dividing the phoneme data set up into a training set consisting of 70% of the samples, and a testing set consisting of 30% of the examples. Specifically, the training data now has n = 5000 observations sampled according to the midpoints of a Latin Hypercube design uniformly on [0, 1]20. We again randomly select 20% or 1000 of these points to be 1 s with the remaining 4000 to be +1 s. Since in 20 dimensions it is difficult to display the resulting classification rules graphically we instead examine the rules on a hold out sample of 10,000 points sampled uniformly and independently on [0, 1]20.
Hardware Specification No The paper describes various simulations and analyses, including the use of R packages and decision trees, but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud resources) used to run these experiments.
Software Dependencies No The paper mentions the use of 'rpart package (Therneau and Atkinson, 1997) in R' and 'R implementation random Forest Liaw and Wiener (2002)'. While it names the software and programming language, it does not provide specific version numbers for R, the rpart package, or the randomForest package, which are necessary for a reproducible description of ancillary software.
Experiment Setup Yes The base learners used are trees fit by the rpart package (Therneau and Atkinson, 1997) in R. The trees are grown to a maximum depth of 8, meaning they may have at most 2^8 = 256 terminal nodes. This will be the implementation of Ada Boost we will consider throughout the remainder of this paper. We run Ada Boost for 500 iterations, fit a random forests model with 500 trees, and build a CART tree that is pruned using cross-validation. We repeated a version of the analysis conducted in Section 5.1 for five additional data sets from the UCI repository: Haberman, Wisconsin breast cancer, voting, Pima, and German credit. In our analysis we did the following: we added 5% label noise to the training data, missing values were imputed with the mean prior to model fitting, and all experiments were conducted on 50 random training-testing splits of each data set.