Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers
Authors: Abraham J. Wyner, Matthew Olson, Justin Bleich, David Mease
JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a number of examples to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees, without regularization or early stopping. We will begin with an easy to visualize example that demonstrates how fine interpolation can provide robustness in a noisy setting. In particular, we compare the performance of Ada Boost, random forests and one-nearest neighbors, which are all interpolating classifiers. We will show that the self-averaging property of Ada Boost and random forests is crucial. This property will be discussed in subsequent sections. The implementation of Ada Boost is carried out according to the algorithm described earlier. The base learners used are trees fit by the rpart package (Therneau and Atkinson, 1997) in R. The trees are grown to a maximum depth of 8, meaning they may have at most 28 = 256 terminal nodes. This will be the implementation of Ada Boost we will consider throughout the remainder of this paper. We will consider again the pure noise model as described in the previous section, where the probability that y is equal to +1 for every x is some constant value p > .5. For the training data we will take n = 400 points uniformly sampled on [0, 1]2 according to a Latin Hypercube using the midpoints as before. For the corresponding y values in training data we will randomly choose 80 points to be 1 s so that P (y = 1|x) = .8. Figure 5 displays the results for the following: (a) one-NN, (b) Ada Boost , and (c) random forests. We now repeat the simulation in Section 3.3 with a larger sample size and in 20 dimensions instead of 2. Specifically, the training data now has n = 5000 observations sampled according to the midpoints of a Latin Hypercube design uniformly on [0, 1]20. We again randomly select 20% or 1000 of these points to be 1 s with the remaining 4000 to be +1 s. Since in 20 dimensions it is difficult to display the resulting classification rules graphically we instead examine the rules on a hold out sample of 10,000 points sampled uniformly and independently on [0, 1]20. Figure 8 plots the proportion of points in the hold out sample classified by Ada Boost as +1 as a function of the number of iterations. We repeated a version of the analysis conducted in Section 5.1 for five additional data sets from the UCI repository: Haberman, Wisconsin breast cancer, voting, Pima, and German credit. In our analysis we did the following: we added 5% label noise to the training data, missing values were imputed with the mean prior to model fitting, and all experiments were conducted on 50 random training-testing splits of each data set. |
| Researcher Affiliation | Collaboration | Abraham J. Wyner EMAIL Matthew Olson EMAIL Justin Bleich EMAIL Department of Statistics Wharton School, University of Pennsylvania Philadelphia, PA 19104, USA David Mease EMAIL Apple Inc. |
| Pseudocode | Yes | Algorithm 1: Ada Boost Hastie et al. (2009) 1. Initialize the observation weights wi = 1 N , i = 1, 2, . . . , N. 2. For m = 1 to M: (a) Fit a classifier Gm(x) to the training data using weights wi. (b) Compute errm = PN i=1 wi I (yi = Gt (xi)) PN i=1 wi . (c) Compute am = log 1 errt (d) Set wi wi exp (at I (yi = Gt (xi))) (e) Set fi(x) = PM m=1 am Gm (x) 3. Output f(x) = sign (f M(x)) ... Algorithm 2: Random Forests Hastie et al. (2009) 1. For b = 1 to B: (a) Draw a bootstrap sample X of size N from the training data (b) Grow a decision tree Tb to the data X by doing the following recursively until the minimum node size nmin is reached: i. Select m of the p variables ii. Pick the best variable/split-point from the m variables and partition 2. Output the ensemble {Tb}B b |
| Open Source Code | No | The paper mentions using the 'rpart package (Therneau and Atkinson, 1997) in R' and 'R implementation random Forest Liaw and Wiener (2002)'. These are third-party tools that were utilized. The authors do not provide a specific link or explicit statement about releasing their own source code for the methodology described in this paper. |
| Open Datasets | Yes | The original data can be found at the following address: https://www.elen.ucl.ac.be/neuralnets/Research/Projects/ELENA/databases/REAL/phoneme/phoneme.txt. We repeated a version of the analysis conducted in Section 5.1 for five additional data sets from the UCI repository: Haberman, Wisconsin breast cancer, voting, Pima, and German credit. |
| Dataset Splits | Yes | We begin by dividing the phoneme data set up into a training set consisting of 70% of the samples, and a testing set consisting of 30% of the examples. Specifically, the training data now has n = 5000 observations sampled according to the midpoints of a Latin Hypercube design uniformly on [0, 1]20. We again randomly select 20% or 1000 of these points to be 1 s with the remaining 4000 to be +1 s. Since in 20 dimensions it is difficult to display the resulting classification rules graphically we instead examine the rules on a hold out sample of 10,000 points sampled uniformly and independently on [0, 1]20. |
| Hardware Specification | No | The paper describes various simulations and analyses, including the use of R packages and decision trees, but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud resources) used to run these experiments. |
| Software Dependencies | No | The paper mentions the use of 'rpart package (Therneau and Atkinson, 1997) in R' and 'R implementation random Forest Liaw and Wiener (2002)'. While it names the software and programming language, it does not provide specific version numbers for R, the rpart package, or the randomForest package, which are necessary for a reproducible description of ancillary software. |
| Experiment Setup | Yes | The base learners used are trees fit by the rpart package (Therneau and Atkinson, 1997) in R. The trees are grown to a maximum depth of 8, meaning they may have at most 2^8 = 256 terminal nodes. This will be the implementation of Ada Boost we will consider throughout the remainder of this paper. We run Ada Boost for 500 iterations, fit a random forests model with 500 trees, and build a CART tree that is pruned using cross-validation. We repeated a version of the analysis conducted in Section 5.1 for five additional data sets from the UCI repository: Haberman, Wisconsin breast cancer, voting, Pima, and German credit. In our analysis we did the following: we added 5% label noise to the training data, missing values were imputed with the mean prior to model fitting, and all experiments were conducted on 50 random training-testing splits of each data set. |