Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022;32(3):39.
doi: 10.1007/s11222-022-10097-z. Epub 2022 May 13.

Distributional anchor regression

Affiliations

Distributional anchor regression

Lucas Kook et al. Stat Comput. 2022.

Abstract

Prediction models often fail if train and test data do not stem from the same distribution. Out-of-distribution (OOD) generalization to unseen, perturbed test data is a desirable but difficult-to-achieve property for prediction models and in general requires strong assumptions on the data generating process (DGP). In a causally inspired perspective on OOD generalization, the test data arise from a specific class of interventions on exogenous random variables of the DGP, called anchors. Anchor regression models, introduced by Rothenhäusler et al. (J R Stat Soc Ser B 83(2):215-246, 2021. 10.1111/rssb.12398), protect against distributional shifts in the test data by employing causal regularization. However, so far anchor regression has only been used with a squared-error loss which is inapplicable to common responses such as censored continuous or ordinal data. Here, we propose a distributional version of anchor regression which generalizes the method to potentially censored responses with at least an ordered sample space. To this end, we combine a flexible class of parametric transformation models for distributional regression with an appropriate causal regularizer under a more general notion of residuals. In an exemplary application and several simulation scenarios we demonstrate the extent to which OOD generalization is possible.

Keywords: Anchor regression; Covariate shift; Diluted causality; Distributional regression; Out-of-distribution generalization; Transformation models.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Graphical models for the response variable Y, covariates X and hidden confounders H: IV regression with instruments A (left) and anchor regression with anchor A (right). In anchor regression, A is only required to be a source node but is allowed to directly influence response, covariates and hidden confounders
Fig. 2
Fig. 2
Illustration of an unconditional transformation model (1-exp(-exp(·)),bBs,6,ϑ) for the Old Faithful Geyser data (Azzalini and Bowman 1990) using a Bernstein polynomial basis expansion of order 6 for the transformation function, h(y)=bBs,6(y)ϑ. The colored regions indicate the transport of probability mass from PY (lower right) to PZ (upper left) via the transformation function h(y) (upper right). If h is continuously differentiable, the density of Y is given by fY(y)=fZ(h(y))h(y)
Fig. 3
Fig. 3
Structural equation model for a transformation model. Instead of setting up the SEM on the scale of Y, it is defined on the scale of the inverse transformation function h-1. The conditional distribution of Y:=h-1(Z|X,H,A) is still fully determined by h and FZ. The circle around Y emphasizes that its distribution is a deterministic function of its parents
Fig. 4
Fig. 4
Leave-one-environment-out cross validation under increasing causal regularization for the BostonHousing2 data, with town as anchors. A linear (Lm), continuous probit (c-probit) and continuous logit (c-logit, using the exact and censored response) model is fitted on 91 towns and used to predict the left out town. a Mean out-of-sample NLL for the left-out census tracts. Beacon Hill, Back Bay and North End are consistently hardest to predict. Consequently, for these towns the cross-validated NLL improves with increasing causal regularization up to a certain point. For the majority of the remaining towns prediction performance decreases. We thus indeed improve worst-case prediction, in analogy to Eq. (2). Note that log10ξ=- corresponds to the unpenalized model. b Scaled regression coefficients, which are interpretable as difference in means (Lm), difference in transformed means (c-probit) and log odds-ratios (c-logit) per standard deviation increase in a covariate. Solely the c-logit (censored) model accounts for right-censored observations. With increasing causal regularization the estimates shrink towards zero, indicating that town may be a weak instrument (see Appendix E)
Fig. 5
Fig. 5
Density estimates for the three census tracts (Loc 1, Loc 2, Loc 3) in Boston Beacon Hill, the hardest to predict town in terms of LOEO cross-validated NLL for ξ=10 (cf. Fig. 4). The dashed gray line indicates the observed outcomes for all three locations, which were all censored at $50000. Lm assumes equal variances and conditional normality, whereas c-probit loosens this assumption leading to more accurate, skewed distributions. Only c-logit (censored) takes into account right censoring in the data and puts a smaller probability density on $50000 than the c-logit (exact) model, which ignores the censoring
Fig. 6
Fig. 6
Test performance (thin lines) over 100 simulations for scenario la with ntrain=300 and ntest=2000. Median test performance over all simulations is indicated by the thick line. The α-quantiles of test absolute prediction error APE:=|y-y^|, where y^ denotes the conditional median, is shown for linear L2 anchor regression (a) using γ=13 and the negative log-likelihood contributions for distributional (conditionally Gaussian) linear anchor regression (b) with ξ=(γ-1)/2=6. The two models are equivalent up to estimating the residual variance via maximum likelihood in the distributional anchor TM. The change in perspective from an L2 to a distributional loss requires different evaluation metrics, of which the log-likelihood, being a proper scoring rule, is the most natural choice
Fig. 7
Fig. 7
Test performance over 100 simulations for scenario nla with ntrain=300 and ntest=2000. Mean (a) and α-quantiles of the negative log-likelihood contributions (b) for the c-probit anchor TM. The test data are generated under strong push-interventions on the distribution of the anchors (cf. Table 2). The strength of causal regularization was chosen as ξ=6
Fig. 8
Fig. 8
Test performance over 100 simulations for scenario iv1 with ntrain=1000 and ntest=2000. Quantiles of the individual negative log-likelihood contributions (a) and estimates of β (b) for increasingly strong causal regularization. The ground truth is indicated by a dashed line. The test data are generated under the intervention do(A=3.6)
Fig. 9
Fig. 9
Test performance and coefficient estimates over 200 simulations for scenario iv2. Because the results are comparable for differing sample sizes and numbers of classes, solely the results for ntrain=1000 and K=10 are displayed. a: Test log-likelihood contributions for varyingly strong instruments (columns) and perturbation sizes (rows). b: Parameter estimates β^ for all intervention scenarios together, since they do not influence estimation. The simulated ground truth β=0.5 is indicated with a dashed line
Fig. 10
Fig. 10
Test performance (quantile functions) over 100 simulations for scenario nla with the same parameters as in Fig. 7, but using a mis-specified linear L2 anchor (a) and normal linear anchor transformation (b) model. The quantile function of the absolute prediction errors (a) and negative log-likelihood (b) contributions are shown with γ=13 and ξ=6, respectively. The point-wise median over all simulations is indicated by a thick line. As a reference, we show the point-wise median NLL of the c-probit model from Fig. 7 in b
Fig. 11
Fig. 11
Test performance (quantile functions) over 100 simulations for scenario nla with the same parameters as in Fig. 7, using conditional mean predictions from L2 anchor boosting (γ=13) and conditional median predictions from a c-probit model (ξ=6). The point-wise median over all simulations is indicated by a thick line
Fig. 12
Fig. 12
Test performance (quantile functions) over 100 simulations for scenario iv1 with the same parameters as in Fig. 8, but using a mis-specified linear L2 anchor (a) and normal linear anchor transformation (b) model. The quantile function of the absolute prediction errors (a) and negative log-likelihood (b) contributions are shown with γ=15 and ξ=7, respectively. The point-wise median over all simulations is indicated by a thick line. The original point-wise median NLL of the correctly specified c-probit model from Fig. 8 is depicted in black
Fig. 13
Fig. 13
Simulation results for scenario iv2 for K{4,6} repeated 100 times. a The average test NLL is displayed for each simulation, varying MX and constant do(A=3). b Coefficient estimates for each model, where the simulated ground truth (β=0.5) is indicated by a dashed line. For details, see Fig. 9

References

    1. Aalen O, Borgan O, Gjessing H. Survival and Event History Analysis: A Process Point of View. Berlin: Springer; 2008.
    1. Abadi, M. et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://tensorflow.org/, software available from tensorflow.org (2015)
    1. Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J. Am. Stat. Assoc. 1996;91(434):444–455. doi: 10.1080/01621459.1996.10476902. - DOI
    1. Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. arXiv preprint arXiv:1907.02893 (2019)
    1. Azzalini A, Bowman AW. A look at some data on the old faithful Geyser. Appl. Stat. 1990;39(3):357. doi: 10.2307/2347385. - DOI

LinkOut - more resources