Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2018 Apr 1:39:95-112.
doi: 10.1146/annurev-publhealth-040617-014208. Epub 2017 Dec 20.

Big Data in Public Health: Terminology, Machine Learning, and Privacy

Affiliations
Review

Big Data in Public Health: Terminology, Machine Learning, and Privacy

Stephen J Mooney et al. Annu Rev Public Health. .

Abstract

The digital world is generating data at a staggering and still increasing rate. While these "big data" have unlocked novel opportunities to understand public health, they hold still greater potential for research and practice. This review explores several key issues that have arisen around big data. First, we propose a taxonomy of sources of big data to clarify terminology and identify threads common across some subtypes of big data. Next, we consider common public health research and practice uses for big data, including surveillance, hypothesis-generating research, and causal inference, while exploring the role that machine learning may play in each use. We then consider the ethical implications of the big data revolution with particular emphasis on maintaining appropriate care for privacy in a world in which technology is rapidly changing social norms regarding the need for (and even the meaning of) privacy. Finally, we make suggestions regarding structuring teams and training to succeed in working with big data in research and practice.

Keywords: big data; machine learning; privacy; public health; training.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A schematic illustration of deductive disclosure: merging two datasets that are each successfully anonymized may result in a dataset in which subjects can be personally identified.

References

    1. Aiello AE, Simanek AM, Eisenberg MC, Walsh AR, Davis B, et al. 2016. Design and methods of a social network isolation study for reducing respiratory infection transmission: The eX-FLU cluster randomized trial. Epidemics 15:38–55 - PMC - PubMed
    1. Alaa AM, van der Schaar M. 2017. Bayesian Inference of Individualized Treatment Effects using Multi-task Gaussian Processes. arXiv preprint arXiv:1704.02801
    1. Alter O, Brown PO, Botstein D. 2000. Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences 97:10101–6 - PMC - PubMed
    1. Anderson TK. 2009. Kernel density estimation and K-means clustering to profile road accident hotspots. Accident; analysis and prevention 41:359–64 - PubMed
    1. Anderson TK. 2009. Kernel density estimation and K-means clustering to profile road accident hotspots. Accident Analysis & Prevention 41:359–64 - PubMed

Publication types

LinkOut - more resources