Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022:23:305.

Tree-Values: Selective Inference for Regression Trees

Affiliations

Tree-Values: Selective Inference for Regression Trees

Anna C Neufeld et al. J Mach Learn Res. 2022.

Abstract

We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.

Keywords: CART; Regression trees; hypothesis testing; post-selection inference; selective inference.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
The regression tree takes the form TREE={p,χ1,s1,1,χ1,s1,0,χ1,s1,1χ2,s2,1,χ1,s1,1χ2,s2,0}. The regions RA=χ1,s1,1χ2,s2,1 and RB=χ1,s1,1χ2,s2,0 are siblings, and are children, and therefore descendants, of the region χ1,s1,1. The ancestors of RA and RB are Rp and χ1,s1,1. Furthermore, RA,RB, and χ1,s1,0 are terminal regions.
Figure 2:
Figure 2:
Data with n=100 and p=2. Regions resulting from CART (λ=0) are delineated using solid lines. Here, RA=χ1,26,0χ2,72,1 and RB=χ1,26,0χ2,72,0. Top: Output of CART applied to yϕ,νsib, where νsib in (6) encodes the contrast between RA and RB, for various values of ϕ. The left-most panel displays y=yνsibTy,νsib. By inspection, we see that -14.9Ssib0νsib and 5Ssib0νsib, but 0Ssib0νsib and 40Ssib0νsib. In fact, Ssib0νsib=(-19.8,-1.8)(0.9,34.9). Bottom: Output of CART applied to yϕ,νreg, where νreg in (14) encodes membership in RA. The left-most panel displays y=yνregTy,νreg. Here, Sreg0νreg=(-,3.1)(5.8,8.8)(14.1,).
Figure 3:
Figure 3:
The true mean model in Section 5, for a=0.5 (left), a=1 (center), and a=2 (right). The difference in means between the sibling nodes at level two in the tree is ab, while the difference in means between the sibling nodes at level three is b.
Figure 4:
Figure 4:
Quantile-quantile plots of the p-values for testing H0:νsibTμ=0, as described in Section 5.3. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2. The p-values are stratified by the level of the regions in the fitted tree.
Figure 5:
Figure 5:
Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with selective Z-tests (pink), CTree (black), and CART with sample splitting (blue) across different settings of the data generating mechanism, stratified by level in tree. As CTree only makes a split if the p-value is less than 0.05, the proportion of detections equals the proportion of rejections.
Figure 6:
Figure 6:
The median width of the selective Z-intervals for parameter νregTμ for regions at levels one (solid), two (dashed), and three (dotted) of the tree. Similar results hold for parameter νsibTμ. Panel (a) breaks results down by the parameters a and b, whereas panel (b) aggregates results across values of parameters a and b, and displays them as a function of the adjusted Rand Index between the true and estimated trees.
Figure 7:
Figure 7:
QQ plots of the p-values from testing H0:νsibTμ=0 when μ=0n using the selective Z-test with three different values plugged in to the truncated normal CDF for σ. The p-values are stratified by the level of the regions in the fitted tree.
Figure 8:
Figure 8:
Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with the three versions of the selective Z-test. The results are stratified by level in tree.
Figure 9:
Figure 9:
Left: A CART tree fit to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a confidence interval (23). The shading of the nodes indicates the average response values (white indicates a very small value and dark blue a very large value). Top right: A CTree fit to the Box Lunch Study data. Bottom right: A scatterplot showing the relationship between the covariate hunger and the response.
Figure 10:
Figure 10:
(a). An illustration of Case 1 (red), Case 2 (blue), and Case 3 (black) for a region RTREE0yϕ1,ν in the base case of the proof of Lemma 27, where ()=R(0),,R(3). (b.) The black regions show the possible cases for RTREEk-1 in the inductive step of the proof of Lemma 27.
Figure 11:
Figure 11:
Simulation results comparing inference based on the full conditioning set to inference based on the identity permutation only (see Section 4.3). The left panel shows power curves. The center panel zooms in on one section of the left panel. The right panel shows median widths of confidence intervals.
Figure 12:
Figure 12:
Quantile-quantile plots of the p-values for testing H0:νsibTμ=0 under a global null. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2.
Figure 13:
Figure 13:
A CART tree fit to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a confidence interval (23). Inference is carried out by plugging in σˆcons , from Section 5.7, as an estimate of σ.

References

    1. Athey Susan and Imbens Guido. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016. - PMC - PubMed
    1. Bhattacharya PK. Some aspects of change-point analysis. Lecture Notes-Monograph Series, pages 28–56, 1994.
    1. Bourgon Richard. Overview of the intervals package, 2009. R Vignette, URL https://cran.r-project.org/web/packages/intervals/vignettes/intervals_ov....
    1. Breiman Leo, Friedman Jerome, Stone Charles J, and Olshen Richard A. Classification and regression trees. CRC Press, 1984.
    1. Chen Shuxiao and Bien Jacob. Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, 29(2):323–334, 2020.

LinkOut - more resources