Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep-Oct;136(5):554-561.
doi: 10.1177/00333549211026817. Epub 2021 Jun 17.

Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use

Affiliations

Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use

Brian Lee et al. Public Health Rep. 2021 Sep-Oct.

Abstract

Objectives: Federal open-data initiatives that promote increased sharing of federally collected data are important for transparency, data quality, trust, and relationships with the public and state, tribal, local, and territorial partners. These initiatives advance understanding of health conditions and diseases by providing data to researchers, scientists, and policymakers for analysis, collaboration, and use outside the Centers for Disease Control and Prevention (CDC), particularly for emerging conditions such as COVID-19, for which data needs are constantly evolving. Since the beginning of the pandemic, CDC has collected person-level, de-identified data from jurisdictions and currently has more than 8 million records. We describe how CDC designed and produces 2 de-identified public datasets from these collected data.

Methods: We included data elements based on usefulness, public request, and privacy implications; we suppressed some field values to reduce the risk of re-identification and exposure of confidential information. We created datasets and verified them for privacy and confidentiality by using data management platform analytic tools and R scripts.

Results: Unrestricted data are available to the public through Data.CDC.gov, and restricted data, with additional fields, are available with a data-use agreement through a private repository on GitHub.com.

Practice implications: Enriched understanding of the available public data, the methods used to create these data, and the algorithms used to protect the privacy of de-identified people allow for improved data use. Automating data-generation procedures improves the volume and timeliness of sharing data.

Keywords: COVID-19; SARS-CoV-2; data paper; data privacy; de-identification; open data.

PubMed Disclaimer

Conflict of interest statement

Declaration of Conflicting Interests: The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1
Figure 1
Creation of COVID-19 case surveillance public datasets from state, tribal, local, and territorial public health jurisdictions, Centers for Disease Control and Prevention, 2020. Abbreviation: DCIPHER, Data Collation and Integration for Public Health Event Response.
Figure 2
Figure 2
The 7-step process of privacy review implemented by the Centers for Disease Control and Prevention in the design of 2 public datasets for COVID-19 case surveillance in 2020.
Figure 3
Figure 3
An example of how k-anonymity field suppression changes the values of quasi-identifier fields sex, age_group, race_ethnicity_combined to reduce the risk of re-identification of individuals in 2 public datasets developed by the Centers for Disease Control and Prevention in 2020 for COVID-19 case surveillance. When the frequency count of raw records with shared quasi-identifiers is below the k = 5 privacy threshold, suppressed data are produced with “NA” values for some quasi-identifiers so that the frequency increases to 5. Abbreviation: NA, not applicable.
Figure 4
Figure 4
An example of how l-diversity field suppression changes values of the confidential pos_spec_dt field to reduce the risk of disclosure of personally identifiable information in 2 public datasets developed by the Centers for Disease Control and Prevention in 2020 for COVID-19 case surveillance. When the distinct count of raw records with shared quasi-identifiers sex, age_group, race_ethnicity_combined is below the l = 2 privacy threshold, suppressed data are produced with “NA” values for pos_spec_dt, preventing confidential information from being disclosed based on knowing a patient’s quasi-identifier values. Abbreviation: NA, not applicable.

References

    1. Office of Management and Budget . Open data policy (M-13-13). 2013. Accessed November 17, 2020. https://digital.gov/open-data-policy-m-13-13
    1. Office of Management and Budget . OMB memo M-10-06 open government directive. December 8, 2009. Accessed November 24, 2020. https://www.whitehouse.gov/sites/whitehouse.gov/files/omb/memoranda/2010...
    1. US Department of Health and Human Services . Open government at HHS. Accessed November 24, 2020. https://www.hhs.gov/open/index.html
    1. AbouZahr C., Adjei S., Kanchanachitra C. From data to policy: good practices and cautionary tales. Lancet. 2007;369(9566):1039-1046.10.1016/S0140-6736(07)60463-2 - DOI - PubMed
    1. Samarati P., Sweeney L. Protecting Privacy When Disclosing Information: K-Anonymity and Its Enforcement Through Generalization and Suppression. Computer Science Laboratory; 1998.

MeSH terms