Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

Authors: Lucas Mentch, Giles Hooker

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulations and illustrations on a real data set are provided.
Researcher Affiliation Academia Lucas Mentch EMAIL Giles Hooker EMAIL Department of Statistical Science Cornell University Ithaca, NY 14850, USA
Pseudocode Yes Algorithm 1 Subbagging Algorithm 2 Subsampled Random Forest Algorithm 3 ζ1,kn Estimation Procedure Algorithm 4 ζkn,kn Estimation Procedure Algorithm 5 Internal Variance Estimation Method Algorithm 6 Σ1,kn Estimation Procedure
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets Yes Our analysis, we restrict our attention to observations (and non-observations) of the Indigo Bunting species. For the first part of our analysis, we further restrict our attention to observations made during the year 2010. A little more than 400,000 reports of either presence or absence of Indigo Buntings were recorded during 2010 and the data set consists of 23 features. Like many species, the abundance of Indigo Buntings is known to fluctuate throughout the year, so we have two primary goals: (1) to produce confidence intervals for monthly abundance and (2) to show that the feature month is significant for predicting abundance. A presence/absence plot of Indigo Buntings by month is shown in Figure 7. A few features of this plot are worth pointing out. Most obviously, there are many more absence observations each month than presence observations. This makes sense because each time a birder submits a report, they note when Indigo Buntings are not present. Next, we see that this species is only observed during the warmer months, so month seems highly significant for predicting abundance. Finally, we see that all months have a large number of reports, so we need not worry about underreporting issues throughout the year. For our analysis, we restrict our attention to observations (and non-observations) of the Indigo Bunting species. For the first part of our analysis, we further restrict our attention to observations made during the year 2010. A little more than 400,000 reports of either presence or absence of Indigo Buntings were recorded during 2010 and the data set consists of 23 features. Like many species, the abundance of Indigo Buntings is known to fluctuate throughout the year, so we have two primary goals: (1) to produce confidence intervals for monthly abundance and (2) to show that the feature month is significant for predicting abundance. A presence/absence plot of Indigo Buntings by month is shown in Figure 7. A few features of this plot are worth pointing out. Most obviously, there are many more absence observations each month than presence observations. This makes sense because each time a birder submits a report, they note when Indigo Buntings are not present. Next, we see that this species is only observed during the warmer months, so month seems highly significant for predicting abundance. Finally, we see that all months have a large number of reports, so we need not worry about underreporting issues throughout the year. The data is part of the ongoing e Bird citizen science project described in Sullivan et al. (2009). This project is hosted by Cornell s Lab of Ornithology and relies on citizens, referred to as birders, to submit reports of bird observations. Location, bird species observed and not observed, effort level, and number of birds of each species observed are just a few of the variables participants are asked to provide. In addition to the data contained in these reports, landcover characteristics as reported in the 2006 United States National Land Cover Database are also available so that information about the local terrain may be used to help predict species abundance.
Dataset Splits Yes We ran 250 simulations with n = 1000, m = 1000, and k = 75 using a test set consisting of all 41 test points, the 20 central-most points, and the 5 central-most points. For this test, we randomly selected 20 points from the training set as the test set and calculated the test statistic based on an internal variance estimate with n z = 250 and n MC = 5000.
Hardware Specification No The paper does not contain any specific details about the hardware used for experiments, such as GPU/CPU models or memory specifications.
Software Dependencies No Each tree in the ensembles was built using the rpart function in R, with the additional restriction that at least 3 observations per node were needed in order for the algorithm to consider splitting on that node. These trees were grown using the randomForest function in R with the ntree argument set to 1. At each node in each tree, 3 of the 5 features X1, ..., X5 were selected at random as candidates for splits and we insisted on at least 2 observations in each terminal node.
Experiment Setup Yes Each tree in the ensembles was built using the rpart function in R, with the additional restriction that at least 3 observations per node were needed in order for the algorithm to consider splitting on that node. We would also like to acknowledge similar work currently in progress by Wager (2014). Wager builds upon the potential nearest neighbor framework introduced by Lin and Jeon (2006) and seeks to provide a limiting distribution for the case where many trees are used in the ensemble, roughly corresponding to our result (i) in Theorems 1 and 2. The author considers only an idealized class of trees based on the assumptions in Meinshausen (2006) as well as additional honesty and regularity conditions that allow kn to grow at a faster rate, and demonstrates that when many Monte Carlo samples are employed, the infinitesimal jackknife estimator of variance is consistent and predictions are asymptotically normal. This estimator has roughly the same computational complexity as those we propose in Section 3 and should scale well subject to some additional bookkeeping. In contrast, the theory we provide here takes into account all possible rates of Monte Carlo sampling via the three cases discussed in Theorems 1 and 2 and we provide a consistent means for estimating each corresponding variance. These trees were grown using the randomForest function in R with the ntree argument set to 1. At each node in each tree, 3 of the 5 features X1, ..., X5 were selected at random as candidates for splits and we insisted on at least 2 observations in each terminal node.