Scalable and Efficient Hypothesis Testing with Random Forests

Authors: Tim Coleman, Wei Peng, Lucas Mentch

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulations and applications to ecological data, where random forests have recently shown promise, are provided. ... In Section 4, we present simulation studies of the testing procedure for a variety of underlying regression functions, as well as a comparison with two different knockoffstatistics. In Section 5, we apply our procedure to multiple ecological datasets where random forests have been successfully employed in recent applied work.
Researcher Affiliation Academia Tim Coleman EMAIL Wei Peng EMAIL Lucas Mentch EMAIL Department of Statistics University of Pittsburgh Pittsburgh, PA 15215, USA
Pseudocode Yes Algorithm 1: Permutation test pseudocode for variable importance
Open Source Code No The paper mentions using "random Forest package in R (Liaw and Wiener, 2002)" and "ranger package (Wright and Ziegler, 2015)" but these are third-party tools. There is no explicit statement or link indicating that the authors' own implementation code for the methodology described in the paper is made publicly available.
Open Datasets Yes Model 4 where the true data generating model is a random forest. We utilize a dataset from Coleman et al. (2017) ... Fish Toxicity We simulate X from the UCI fish toxicity data set provided by Cassotti et al. (2015) ... Forest Fires: Cortez and Morais (2007) sought to predict log(1+area) burned by several fires in northern Portugal using covariate information on location, time of year, and local weather characteristics.
Dataset Splits Yes For each of our simulations, we train random forests using the random Forest package in R (Liaw and Wiener, 2002) using the default mtry parameters. ... In both settings, we draw n = 2000 points from the joint distribution of (X, Y ), subsample sizes of kn = n0.6 95, and build B = 125 trees in each forest. Predictions were made at Nt = 100 test points... For our procedure, we build 125 trees, holdout 90 observations at random for testing... Here we select 15% of the available observations ( 3800 points) uniformly at random to serve as the test set where the hypotheses will be evaluated.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. It only discusses software, datasets, and experimental setup parameters.
Software Dependencies No We train random forests using the random Forest package in R (Liaw and Wiener, 2002) using the default mtry parameters. ... The random forests were trained with the ranger package using the default mtry = 4... The paper mentions specific software packages (random Forest package in R, ranger package) but does not provide version numbers for these packages or R itself.
Experiment Setup Yes For each of our simulations, we train random forests using the random Forest package in R (Liaw and Wiener, 2002) using the default mtry parameters. ... subsample sizes of kn = n0.6 95, and build B = 125 trees in each forest. Predictions were made at Nt = 100 test points... For Models 1 and 2, we focus on a marginal signal to noise ratio, which is controlled by the parameters β and σ. We fix β = 10 across all simulations let σ = 10/j where j takes 9 equally spaced values between 0.005 and 2.25... for Model 3, we let kn = n0.6 46, B = 125, Nt = 100, and vary the β coefficient according to 8 equally spaced values between 0.01 and 2.5 and also for 7 equally spaced values between 5 and 20. In Model 4, we let n = 2000, kn = n0.6, B = 125, Nt = 100, and let σ = e j for 10 values of j equally spaced between 1 and 5. ... The random forests were trained with the ranger package using the default mtry = 4, subsamples of size kn = n0.6, and consisting of B = 250 trees in each. ... using mtry = 12 and kn = n0.6 43, B = 250 trees for the importance test and B = 500 trees for the overall test