Causal inference and the data-fusion problem

Elias Bareinboim¹, Judea Pearl²

Affiliations

¹ Department of Computer Science, University of California, Los Angeles, CA 90095; Department of Computer Science, Purdue University, West Lafayette, IN 47907 eb@purdue.edu.
² Department of Computer Science, University of California, Los Angeles, CA 90095;

PMID: 27382148
PMCID: PMC4941504
DOI: 10.1073/pnas.1510507113

Causal inference and the data-fusion problem

Elias Bareinboim et al. Proc Natl Acad Sci U S A. 2016.

. 2016 Jul 5;113(27):7345-52.

doi: 10.1073/pnas.1510507113.

Authors

Elias Bareinboim¹, Judea Pearl²

Affiliations

¹ Department of Computer Science, University of California, Los Angeles, CA 90095; Department of Computer Science, Purdue University, West Lafayette, IN 47907 eb@purdue.edu.
² Department of Computer Science, University of California, Los Angeles, CA 90095;

PMID: 27382148
PMCID: PMC4941504
DOI: 10.1073/pnas.1510507113

Abstract

We review concepts, principles, and tools that unify current approaches to causal analysis and attend to new challenges presented by big data. In particular, we address the problem of data fusion-piecing together multiple datasets collected under heterogeneous conditions (i.e., different populations, regimes, and sampling methods) to obtain valid answers to queries of interest. The availability of multiple heterogeneous datasets presents new opportunities to big data analysts, because the knowledge that can be acquired from combined data would not be possible from any individual source alone. However, the biases that emerge in heterogeneous environments require new analytical tools. Some of these biases, including confounding, sampling selection, and cross-population biases, have been addressed in isolation, largely in restricted parametric models. We here present a general, nonparametric framework for handling these biases and, ultimately, a theoretical solution to the problem of data fusion in causal inference tasks.

Keywords: causal inference; counterfactuals; external validity; selection bias; transportability.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Prototypical generalization tasks where the goal is, for example, to estimate a causal effect in a target population (*Top*). Let $V = {X, Y, Z, W}$ . There are different designs (*Bottom*) showing that data come from nonidealized conditions, specifically: (1) from the same population under an observational regime, $P (v)$ ; (2) from the same population under an experimental regime when Z is randomized, $P (v | d o (z))$ ; (3) from the same population under sampling selection bias, $P (v | S = 1)$ or $P (v | d o (x), S = 1)$ ; and (4) from a different population that is submitted to an experimental regime when X is randomized, $P (v | d o (x), S = s)$ , and observational studies in the target population.

**Fig. 2.**
(A) Graphical model illustrating d-separation and the backdoor criterion. U terms are not shown explicitly. (B) Illustration of the intervention $d o (X = x)$ with arrows toward X cut. (C) Illustration of the spurious paths, which pop out when we cut the outgoing edges from X and need to be blocked if one wants to use adjustment.

**Fig. 3.**
Graphical models illustrating identification of $Q = P (y | d o (x))$ through the use of experiments over an auxiliary variable Z. Identifiability follows from $P (x, y | d o (Z = z))$ in A, and it also requires $P (v)$ in B. Identifiability in models A and B follows from the identifiability of Q in $G_{\bar{Z}}$ .

**Fig. 4.**
Canonical models where selection is treatment dependent in A and B and also outcome dependent in A. More complex models in which ${W_{1}, W_{2}}$ and ${Z}$ are sufficient for adjustment, but only the latter is adequate for recovering from selection bias, are shown in C. There is no sufficient set for adjustment without external data in *D–F*. (D) Example of S-backdoor admissible set. (E and F) Structures with no S-admissible sets that require more involved recoverability strategies involving posttreatment variables.

**Fig. 5.**
Selection diagrams depicting differences between source and target populations. In A, the two populations differ in age (Z) distributions (so S points to Z). In B, the populations differ in how reading skills (Z) depend on age (an unmeasured variable, represented by the open circle) and the age distributions are the same. In C, the populations differ in how Z depends on X. In D, the unmeasured confounder (bidirected arrow) between Z and Y precludes transportability.

See this image and copyright information in PMC

References

1. Pearl J. 2009. Causality: Models, Reasoning, and Inference (Cambridge Univ Press, New York), 2nd Ed.
1. Pearl J. Causal inference in statistics: An overview. Stat Surv. 2009;3:96–146.
1. Pearl J, Glymour M, Jewell NP. 2016 Causal Inference in Statistics: A Primer (Wiley, New York)
1. Angrist J, Imbens G, Rubin D. Identification of causal effects using instrumental variables (with comments) J Am Stat Assoc. 1996;91(434):444–472.
1. Greenland S, Lash T. In: Bias Analysis in Modern Epidemiology. 3rd Ed. Rothman K, Greenland S, Lash T, editors. Lippincott Williams & Wilkins; Philadelphia: 2008. pp. 345–380.

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Causal inference and the data-fusion problem

Affiliations

Causal inference and the data-fusion problem

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources