Confidence and Uncertainty Assessment for Distributional Random Forests

Authors: Jeffrey Näf, Corinne Emmenegger, Peter Bühlmann, Nicolai Meinshausen

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In simulation studies, we empirically validate the developed theory for inference of low-dimensional targets and for testing distributional differences between two populations. ... In this section, we demonstrate the performance of our DRF confidence intervals for the CATE, conditional quantiles, conditional correlations, and conditional witness functions for simulated data.
Researcher Affiliation Academia Jeffrey N af EMAIL Inria, Pre Me DICa L Team, University of Montpellier 34000 Montpellier, France; Corinne Emmenegger EMAIL Seminar for Statistics ETH Zurich 8092 Zurich, Switzerland; Peter B uhlmann EMAIL Seminar for Statistics ETH Zurich 8092 Zurich, Switzerland; Nicolai Meinshausen EMAIL Seminar for Statistics ETH Zurich 8092 Zurich, Switzerland
Pseudocode Yes Algorithm 1 Pseudocode for Distributional Random Forest with Uncertainty. The functions Build Forest and Get Weights are defined in Algorithm 2 in Appendix C. ... Algorithm 2 Pseudocode for Distributional Random Forest in Cevid et al. (2022)
Open Source Code Yes Code of our analysis is available on Git Hub (https://github.com/Jeff Naef/drfinference).
Open Datasets No We consider almost exclusively data generating mechanisms that have already been considered by Cevid et al. (2022). The only adaptation is that we consider U( 1, 1)p distributed covariates X instead of U(0, 1)p in Section 6.3. ... We simulate data from X Unif(0, 1)5, W | X Bernoulli 0.25(1 + β2,4(X3)) Y | (X, W) 2(X3 0.5) + N(0, 1) ... X Unif( 1, 1)5, Y N 0.8 1X1>0, 1. The paper describes data generating mechanisms and simulated data, but does not provide access information for any publicly available or open datasets.
Dataset Splits No First, we require that the data used to build a tree is independent from the data used to populate its leaves for prediction. To ensure this, we split the subsample used to build a particular tree into two halves. The first half is used to construct the tree. Then, the data from the second half gets assigned to the leaves of the tree according to the covariate splits that were fitted on the first half. The paper describes internal splitting for tree construction within the algorithm (e.g., splitting data into two halves for honesty), but it does not specify overall train/test/validation splits for the simulated datasets used in the empirical evaluation.
Hardware Specification No With DRF, we used a total number of 105 trees whereas with GRF, we were not able to use as many due to computational reasons. The paper mentions computational considerations but does not provide specific details about the hardware (e.g., CPU, GPU models, memory, or cloud resources) used for the experiments.
Software Dependencies No Since the drf package (Michel and Cevid, 2021) used is based on grf (Tibshirani et al., 2022), this indicates empirically that the target-tailored splitting criterion of GRF can be computationally considerably more expensive than the general splitting criterion of DRF. The paper mentions specific software packages 'drf' and 'grf' with their publication years but does not explicitly state their version numbers within the main text where they are referenced. While version numbers are in the bibliography, the prompt requires explicit mention in the main text.
Experiment Setup Yes In all examples except for the conditional witness functions, we grow a forest that consists of B = 100 subforests with ℓ= 1000 trees each, and we choose β = 0.9 in assumption (F5). To fit trees, 10 random features are used for the approximation of the MMD statistic when splitting the nodes, and the minimal node size is 5. Moreover, we consider the Gaussian kernel with the median bandwidth heuristic and compute confidence intervals using the Gaussian approximation. For the conditional witness functions, we consider forests that consist of B = 200 subforests with ℓ= 1000 trees each and choose β = 0.9 because estimating whole confidence bands for the conditional witness function is a complicated task.