Tree-Values: Selective Inference for Regression Trees
Authors: Anna C. Neufeld, Lucy L. Gao, Daniela M. Witten
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake. |
| Researcher Affiliation | Academia | Anna C. Neufeld EMAIL Department of Statistics University of Washington Seattle, WA 98195, USA Lucy L. Gao EMAIL Department of Statistics University of British Columbia Vancouver, British Columbia, V6T 1Z4, Canada Daniela M. Witten EMAIL Departments of Statistics and Biostatistics University of Washington Seattle, WA 98195, USA |
| Pseudocode | Yes | Algorithm A1 (Growing a tree) 1. If a stopping condition is met, return R. 2. Else return {R, Grow(R χ j, s,1, y), Grow(R χ j, s,0, y)}, where ( j, s) arg max(j,s):s {1,...,n 1},j {1,...,p} gain R (y, j, s) . Algorithm A2 (Cost-complexity pruning) Parameter O is a bottom-up ordering of the K regions in tree. Prune(tree, y, λ, O) 1. Let tree0 = tree. Let K be the number of regions in tree0. 2. For k = 1, . . . , K: (a) Let R be the kth region in O. (b) Update treek as follows, where g( ) is defined in (3): ( treek 1 \ desc(R, treek 1) if g(R, treek 1, y) < λ, treek 1 otherwise. 3. Return tree K. |
| Open Source Code | Yes | A software implementation of the methods in this paper is available in the R package treevalues, at https://github.com/anna-neufeld/treevalues. |
| Open Datasets | Yes | We revisit their analysis of the Box Lunch Study, a clinical trial studying the impact of portion control interventions on 24-hour caloric intake. We consider identifying subgroups of study participants with baseline differences in 24-hour caloric intake on the basis of scores from an assessment that quantifies constructs such as hunger, liking, the relative reinforcement of food (rrvfood), and restraint (resteating). We exactly reproduce the trees presented in Figures 1 and 2 of Venkatasubramaniam et al. (2017) by building a CTree using partykit and a CART tree using rpart on the Box Lunch Study data provided in the R package visTree (Venkatasubramaniam and Wolfson, 2018). |
| Dataset Splits | Yes | Sample splitting: Split the data into equally-sized training and test sets. Fit a CART tree to the training set. On the test set, conduct a naive Z-test for each split and compute a naive Z-interval for each split and each region. |
| Hardware Specification | No | The paper describes experimental setups, including simulation studies and an application to a real dataset, and mentions software packages used. However, it does not provide specific hardware details such as CPU or GPU models, memory specifications, or server configurations used for running the experiments. |
| Software Dependencies | Yes | All CART trees are fit using the R package rpart (Therneau and Atkinson, 2019) with λ = 200, a maximum level of three, and a minimum node size of one. ... Fit a CTree to all of the data using the R package partykit (Hothorn and Zeileis, 2015) with α = 0.05. ... on the Box Lunch Study data provided in the R package visTree (Venkatasubramaniam and Wolfson, 2018). |
| Experiment Setup | Yes | All CART trees are fit using the R package rpart (Therneau and Atkinson, 2019) with λ = 200, a maximum level of three, and a minimum node size of one. We compare three approaches for conducting inference. (i) Selective Z-methods: Fit a CART tree to the data. For each split, test for a difference in means between the two sibling regions using (8), and compute the corresponding confidence interval in (13). Compute the confidence interval for the mean of each region using (23). (ii) Naive Z-methods: Fit a CART tree to the data. For each split, conduct a naive Z-test for the difference in means between the two sibling regions, and compute the corresponding naive Z-interval. Compute a naive Z-interval for each region s mean. (iii) Sample splitting: Split the data into equally-sized training and test sets. Fit a CART tree to the training set. On the test set, conduct a naive Z-test for each split and compute a naive Z-interval for each split and each region. If a region has no test set observations, then we fail to reject the null hypothesis and fail to cover the parameter. ... (iv) CTree: Fit a CTree to all of the data using the R package partykit (Hothorn and Zeileis, 2015) with α = 0.05. |