Scalable and Efficient Hypothesis Testing with Random Forests
Authors: Tim Coleman, Wei Peng, Lucas Mentch
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulations and applications to ecological data, where random forests have recently shown promise, are provided. ... In Section 4, we present simulation studies of the testing procedure for a variety of underlying regression functions, as well as a comparison with two different knockoffstatistics. In Section 5, we apply our procedure to multiple ecological datasets where random forests have been successfully employed in recent applied work. |
| Researcher Affiliation | Academia | Tim Coleman EMAIL Wei Peng EMAIL Lucas Mentch EMAIL Department of Statistics University of Pittsburgh Pittsburgh, PA 15215, USA |
| Pseudocode | Yes | Algorithm 1: Permutation test pseudocode for variable importance |
| Open Source Code | No | The paper mentions using "random Forest package in R (Liaw and Wiener, 2002)" and "ranger package (Wright and Ziegler, 2015)" but these are third-party tools. There is no explicit statement or link indicating that the authors' own implementation code for the methodology described in the paper is made publicly available. |
| Open Datasets | Yes | Model 4 where the true data generating model is a random forest. We utilize a dataset from Coleman et al. (2017) ... Fish Toxicity We simulate X from the UCI fish toxicity data set provided by Cassotti et al. (2015) ... Forest Fires: Cortez and Morais (2007) sought to predict log(1+area) burned by several fires in northern Portugal using covariate information on location, time of year, and local weather characteristics. |
| Dataset Splits | Yes | For each of our simulations, we train random forests using the random Forest package in R (Liaw and Wiener, 2002) using the default mtry parameters. ... In both settings, we draw n = 2000 points from the joint distribution of (X, Y ), subsample sizes of kn = n0.6 95, and build B = 125 trees in each forest. Predictions were made at Nt = 100 test points... For our procedure, we build 125 trees, holdout 90 observations at random for testing... Here we select 15% of the available observations ( 3800 points) uniformly at random to serve as the test set where the hypotheses will be evaluated. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. It only discusses software, datasets, and experimental setup parameters. |
| Software Dependencies | No | We train random forests using the random Forest package in R (Liaw and Wiener, 2002) using the default mtry parameters. ... The random forests were trained with the ranger package using the default mtry = 4... The paper mentions specific software packages (random Forest package in R, ranger package) but does not provide version numbers for these packages or R itself. |
| Experiment Setup | Yes | For each of our simulations, we train random forests using the random Forest package in R (Liaw and Wiener, 2002) using the default mtry parameters. ... subsample sizes of kn = n0.6 95, and build B = 125 trees in each forest. Predictions were made at Nt = 100 test points... For Models 1 and 2, we focus on a marginal signal to noise ratio, which is controlled by the parameters β and σ. We fix β = 10 across all simulations let σ = 10/j where j takes 9 equally spaced values between 0.005 and 2.25... for Model 3, we let kn = n0.6 46, B = 125, Nt = 100, and vary the β coefficient according to 8 equally spaced values between 0.01 and 2.5 and also for 7 equally spaced values between 5 and 20. In Model 4, we let n = 2000, kn = n0.6, B = 125, Nt = 100, and let σ = e j for 10 values of j equally spaced between 1 and 5. ... The random forests were trained with the ranger package using the default mtry = 4, subsamples of size kn = n0.6, and consisting of B = 250 trees in each. ... using mtry = 12 and kn = n0.6 43, B = 250 trees for the importance test and B = 500 trees for the overall test |