Big Data, Data Science, and Causal Inference: A Primer for Clinicians

Yoshihiko Raita¹, Carlos A Camargo Jr^{1

2

3}, Liming Liang^{1

3

4}, Kohei Hasegawa^{1

3

4}

Affiliations

¹ Department of Emergency Medicine, Harvard Medical School, Massachusetts General Hospital, Boston, MA, United States.
² Division of Rheumatology, Allergy, and Immunology, Department of Medicine, Harvard Medical School, Massachusetts General Hospital, Boston, MA, United States.
³ Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.

PMID: 34295910
PMCID: PMC8290071
DOI: 10.3389/fmed.2021.678047

Review

Big Data, Data Science, and Causal Inference: A Primer for Clinicians

Yoshihiko Raita et al. Front Med (Lausanne). 2021.

. 2021 Jul 6:8:678047.

doi: 10.3389/fmed.2021.678047. eCollection 2021.

Authors

Yoshihiko Raita¹, Carlos A Camargo Jr^{1

2

3}, Liming Liang^{1

3

4}, Kohei Hasegawa^{1

3

4}

Affiliations

¹ Department of Emergency Medicine, Harvard Medical School, Massachusetts General Hospital, Boston, MA, United States.
² Division of Rheumatology, Allergy, and Immunology, Department of Medicine, Harvard Medical School, Massachusetts General Hospital, Boston, MA, United States.
³ Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.

PMID: 34295910
PMCID: PMC8290071
DOI: 10.3389/fmed.2021.678047

Abstract

Clinicians handle a growing amount of clinical, biometric, and biomarker data. In this "big data" era, there is an emerging faith that the answer to all clinical and scientific questions reside in "big data" and that data will transform medicine into precision medicine. However, data by themselves are useless. It is the algorithms encoding causal reasoning and domain (e.g., clinical and biological) knowledge that prove transformative. The recent introduction of (health) data science presents an opportunity to re-think this data-centric view. For example, while precision medicine seeks to provide the right prevention and treatment strategy to the right patients at the right time, its realization cannot be achieved by algorithms that operate exclusively in data-driven prediction modes, as do most machine learning algorithms. Better understanding of data science and its tasks is vital to interpret findings and translate new discoveries into clinical practice. In this review, we first discuss the principles and major tasks of data science by organizing it into three defining tasks: (1) association and prediction, (2) intervention, and (3) counterfactual causal inference. Second, we review commonly-used data science tools with examples in the medical literature. Lastly, we outline current challenges and future directions in the fields of medicine, elaborating on how data science can enhance clinical effectiveness and inform medical practice. As machine learning algorithms become ubiquitous tools to handle quantitatively "big data," their integration with causal reasoning and domain knowledge is instrumental to qualitatively transform medicine, which will, in turn, improve health outcomes of patients.

Keywords: big data; causal inference; data science; machine learning; the ladder of causation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Examples of causal directed acyclic graph that encodes *a priori* domain knowledge and causal structural hypothesis. **(A)** Birth-weight paradox. There is no direct arrow from maternal smoking (exposure) to infant mortality (outcome), representing no causal effect. However, association/prediction-mode machine learning algorithm would automatically adjust for variables that are associated both with smoking and mortality (e.g., low birth-weight). Graphically, a rectangle placed around the low-birth weight variable represents adjustment. However, this adjustment for the collider (a node on which two directed arrows “collide”; Table 1) opens the flow of association from exposure → collider → covariates (e.g., structural anomaly) → outcome, which leads to a spurious (non-causal) association. **(B)** Simple example of causal diagram, consisting of exposure (biologic agent), outcome (asthma control), and covariates (e.g., baseline severity of illness). The presence of edge from a variable to another represents our knowledge on the presence of a direct effect. **(C)** Example of confounding. While there is no causal effect (i.e., no direct arrow from exposure to outcome), there is an association between these variables through the paths involving a common-cause covariate (i.e., a confounder), leading to a non-causal association between the exposure and outcome (i.e., confounding; Table 1). **(D)** Example of de-confounding. This confounding can be addressed by adjusting for the confounder by blocking the back-door path. Graphically, a rectangle placed around the confounder blocks the association flow through the back-door path. **(E)** Example of mediation. The causal relation between the exposure (systemic antibiotic use), mediator (airway microbiome), and outcome (asthma development). The confounders (e.g., acute respiratory infections) between the exposure, mediator, and outcome should be adjusted. The indirect (or mediation) effect is represented by the path which passes through the mediator. The direct effect is represented by the path which does not pass (the broken line; Table 1). **(F)** Example of mendelian randomization. Genetic variants that are strongly associated with the exposure of interest (mental illnesses) function as the instrument variable. Note that there is no association (or path) between the genetic variants and unmeasured confounders (i.e., independent condition) and that the genetic variants affect the outcome only through their effect on the exposure (i.e., exclusion restriction condition; Table 3).

**Figure 2**
Identification and estimation of heterogenous treatment effects. In this *hypothetical* example, suppose, we investigate treatment effects of systemic corticosteroids on hospitalization rates among preschool children with virus-induced wheezing. **(A)** Randomized control trial (RCT) to investigate the *average* treatment effect of systemic corticosteroids (conventional 1:1 RCT). **(B)** Investigating heterogeneous treatment effects using tree-based machine learning models. In each of the branches (e.g., subgroup A children have specific virus infection and a history of atopy), children have a comparable predicted probability of receiving systemic corticosteroids. Children within each subgroup function as if they came from an RCT with eligibility criteria stratified by clinical characteristics.

**Figure 3**
Integration of “big data,” data science, and domain knowledge toward precision medicine. Development of precision medicine requires an integration of “big data” from expanded data sources and capture with robust data science methodologies and analytics that encode domain causal knowledge and counterfactual causal reasoning.

See this image and copyright information in PMC

References

1. Pearl J. The seven tools of causal inference, with reflections on machine learning. Commun ACM. (2019) 62:54–60. 10.1145/3241036 - DOI
1. Ashley EA. Towards precision medicine. Nat Rev Genet. (2016) 17:507–22. 10.1038/nrg.2016.86 - DOI - PubMed
1. Donoho D. 50 Years of data science. J Comput Graph Stat. (2017) 26:745–66. 10.1080/10618600.2017.1384734 - DOI
1. Fisher RA. Statistical Methods for Research Workers. 1st ed. Edinburgh: Oliver and Boyd: (1925).
1. Mcconnochie KM, Roghmann KJ. Parental smoking, presence of older siblings, and family history of asthma increase risk of bronchiolitis. Am J Dis Child. (1986) 140:806–12. 10.1001/archpedi.1986.02140220088039 - DOI - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Big Data, Data Science, and Causal Inference: A Primer for Clinicians

Affiliations

Big Data, Data Science, and Causal Inference: A Primer for Clinicians

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Research Materials