. 2022:23:305.

Tree-Values: Selective Inference for Regression Trees

Anna C Neufeld¹, Lucy L Gao², Daniela M Witten³

Affiliations

¹ Department of Statistics, University of Washington, Seattle, WA 98195, USA.
² Department of Statistics, University of British Columbia, Vancouver, British Columbia, V6T 1Z4, Canada.
³ Departments of Statistics and Biostatistics, University of Washington, Seattle, WA 98195, USA.

PMID: 38481523
PMCID: PMC10933572

Tree-Values: Selective Inference for Regression Trees

Anna C Neufeld et al. J Mach Learn Res. 2022.

. 2022:23:305.

Authors

Anna C Neufeld¹, Lucy L Gao², Daniela M Witten³

Affiliations

¹ Department of Statistics, University of Washington, Seattle, WA 98195, USA.
² Department of Statistics, University of British Columbia, Vancouver, British Columbia, V6T 1Z4, Canada.
³ Departments of Statistics and Biostatistics, University of Washington, Seattle, WA 98195, USA.

PMID: 38481523
PMCID: PMC10933572

Abstract

We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.

Keywords: CART; Regression trees; hypothesis testing; post-selection inference; selective inference.

PubMed Disclaimer

Figures

**Figure 1:**
The regression tree takes the form $TREE = {ℝ^{p}, χ_{1, s_{1}, 1}, χ_{1, s_{1}, 0}, χ_{1, s_{1}, 1} \cap χ_{2, s_{2}, 1}, χ_{1, s_{1}, 1} \cap χ_{2, s_{2}, 0}}$ . The regions $R_{A} = χ_{1, s_{1}, 1} \cap χ_{2, s_{2}, 1}$ and $R_{B} = χ_{1, s_{1}, 1} \cap χ_{2, s_{2}, 0}$ are siblings, and are children, and therefore descendants, of the region $χ_{1, s_{1}, 1}$ . The ancestors of $R_{A}$ and $R_{B}$ are $R^{p}$ and $χ_{1, s_{1}, 1}$ . Furthermore, $R_{A}, R_{B}$ , and $χ_{1, s_{1}, 0}$ are terminal regions.

**Figure 2:**
Data with $n = 100$ and $p = 2$ . Regions resulting from CART $(λ = 0)$ are delineated using solid lines. Here, $R_{A} = χ_{1,26,0} \cap χ_{2,72,1}$ and $R_{B} = χ_{1,26,0} \cap χ_{2,72,0}$ . *Top*: Output of CART applied to $y^{'} (ϕ, ν_{s i b})$ , where $ν_{s i b}$ in (6) encodes the contrast between $R_{A}$ and $R_{B}$ , for various values of $ϕ$ . The left-most panel displays $y = y^{'} (ν_{s i b}^{T} y, ν_{s i b})$ . By inspection, we see that $- 14.9 \in S_{s i b}^{0} (ν_{s i b})$ and $5 \in S_{s i b}^{0} (ν_{s i b})$ , but $0 \notin S_{sib}^{0} (ν_{s i b})$ and $40 \notin S_{sib}^{0} (ν_{sib})$ . In fact, $S_{sib}^{0} (ν_{s i b}) = (- 19.8, - 1.8) \cup (0.9,34.9)$ . *Bottom:* Output of CART applied to $y^{'} (ϕ, ν_{r e g})$ , where $ν_{r e g}$ in (14) encodes membership in $R_{A}$ . The left-most panel displays $y = y^{'} (ν_{r e g}^{T} y, ν_{r e g})$ . Here, $S_{r e g}^{0} (ν_{r e g}) = (- \infty, 3.1) \cup (5.8,8.8) \cup (14.1, \infty)$ .

**Figure 3:**
The true mean model in Section 5, for $a = 0.5$ (left), $a = 1$ (center), and $a = 2$ (right). The difference in means between the sibling nodes at level two in the tree is $a b$ , while the difference in means between the sibling nodes at level three is $b$ .

**Figure 4:**
Quantile-quantile plots of the p-values for testing $H_{0} : ν_{s i b}^{T} μ = 0$ , as described in Section 5.3. A naive Z-test (green), sample splitting (blue), and selective $Z$ -test (pink) were performed; see Section 5.2. The p-values are stratified by the level of the regions in the fitted tree.

**Figure 5:**
Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with selective $Z$ -tests (pink), CTree (black), and CART with sample splitting (blue) across different settings of the data generating mechanism, stratified by level in tree. As CTree only makes a split if the p-value is less than 0.05, the proportion of detections equals the proportion of rejections.

**Figure 6:**
The median width of the selective $Z$ -intervals for parameter $ν_{r e g}^{T} μ$ for regions at levels one (solid), two (dashed), and three (dotted) of the tree. Similar results hold for parameter $ν_{s i b}^{T} μ$ . Panel (a) breaks results down by the parameters $a$ and $b$ , whereas panel (b) aggregates results across values of parameters $a$ and $b$ , and displays them as a function of the adjusted Rand Index between the true and estimated trees.

**Figure 7:**
QQ plots of the p-values from testing $H_{0} : ν_{sib}^{T} μ = 0$ when $μ = 0_{n}$ using the selective $Z$ -test with three different values plugged in to the truncated normal CDF for $σ$ . The p-values are stratified by the level of the regions in the fitted tree.

**Figure 8:**
Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with the three versions of the selective $Z$ -test. The results are stratified by level in tree.

**Figure 9:**
*Left:* A CART tree fit to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a confidence interval (23). The shading of the nodes indicates the average response values (white indicates a very small value and dark blue a very large value). *Top right:* A CTree fit to the Box Lunch Study data. *Bottom right:* A scatterplot showing the relationship between the covariate hunger and the response.

**Figure 10:**
*(a)*. An illustration of Case 1 (red), Case 2 (blue), and Case 3 (black) for a region $R \in {T R E E}^{0} \{y^{'} (ϕ_{1}, ν)\}$ in the base case of the proof of Lemma 27, where $ℛ (ℬ) = \{R^{(0)}, \dots, R^{(3)}\}$ . *(b.)* The black regions show the possible cases for $R \in {T R E E}_{k - 1}$ in the inductive step of the proof of Lemma 27.

**Figure 11:**
Simulation results comparing inference based on the full conditioning set to inference based on the identity permutation only (see Section 4.3). The left panel shows power curves. The center panel zooms in on one section of the left panel. The right panel shows median widths of confidence intervals.

**Figure 12:**
Quantile-quantile plots of the p-values for testing $H_{0} : ν_{s i b}^{T} μ = 0$ under a global null. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2.

**Figure 13:**
A CART tree fit to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a confidence interval (23). Inference is carried out by plugging in ${\hat{σ}}_{c o n s}$ , from Section 5.7, as an estimate of $σ$ .

See this image and copyright information in PMC

References

1. Athey Susan and Imbens Guido. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016. - PMC - PubMed
1. Bhattacharya PK. Some aspects of change-point analysis. Lecture Notes-Monograph Series, pages 28–56, 1994.
1. Bourgon Richard. Overview of the intervals package, 2009. R Vignette, URL https://cran.r-project.org/web/packages/intervals/vignettes/intervals_ov....
1. Breiman Leo, Friedman Jerome, Stone Charles J, and Olshen Richard A. Classification and regression trees. CRC Press, 1984.
1. Chen Shuxiao and Bien Jacob. Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, 29(2):323–334, 2020.

Grants and funding

R01 GM123993/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Tree-Values: Selective Inference for Regression Trees

Affiliations

Tree-Values: Selective Inference for Regression Trees

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources