. 2022 Jul 12;29(8):1350-1365.

doi: 10.1093/jamia/ocac045.

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)

Jason A Thomas¹, Randi E Foraker^{2

3}, Noa Zamstein⁴, Jon D Morrow^{4

5}, Philip R O Payne^{2

3}, Adam B Wilcox^{2

3}; N3C Consortium

Collaborators, Affiliations

Collaborators

N3C Consortium:
Melissa A Haendel, Christopher G Chute, Kenneth R Gersing, Anita Walden, Melissa A Haendel, Tellen D Bennett, Christopher G Chute, David A Eichmann, Justin Guinney, Warren A Kibbe, Hongfang Liu, Philip R O Payne, Emily R Pfaff, Peter N Robinson, Joel H Saltz, Heidi Spratt, Justin Starren, Christine Suver, Adam B Wilcox, Andrew E Williams, Chunlei Wu, Christopher G Chute, Emily R Pfaff, Davera Gabriel, Stephanie S Hong, Kristin Kostka, Harold P Lehmann, Richard A Moffitt, Michele Morris, Matvey B Palchuk, Xiaohan Tanner Zhang, Richard L Zhu, Emily R Pfaff, Benjamin Amor, Mark M Bissell, Marshall Clark, Andrew T Girvin, Stephanie S Hong, Kristin Kostka, Adam M Lee, Robert T Miller, Michele Morris, Matvey B Palchuk, Kellie M Walters, Anita Walden, Yooree Chae, Connor Cook, Alexandra Dest, Racquel R Dietz, Thomas Dillon, Patricia A Francis, Rafael Fuentes, Alexis Graves, Julie A McMurry, Andrew J Neumann, Shawn T O'Neil, Usman Sheikh, Andréa M Volz, Elizabeth Zampino, Christopher P Austin, Kenneth R Gersing, Samuel Bozzette, Mariam Deacy, Nicole Garbarini, Michael G Kurilla, Sam G Michael, Joni L Rutter, Meredith Temple-O'Connor, Benjamin Amor, Mark M Bissell, Katie Rebecca Bradwell, Andrew T Girvin, Amin Manna, Nabeel Qureshi, Mary Morrison Saltz, Christine Suver, Christopher G Chute, Melissa A Haendel, Julie A McMurry, Andréa M Volz, Anita Walden, Carolyn Bramante, Jeremy Richard Harper, Wenndy Hernandez, Farrukh M Koraishy, Federico Mariona, Saidulu Mattapally, Amit Saha, Satyanarayana Vedula, Yujuan Fu, Nisha Mathews, Ofer Mendelevitch

Affiliations

¹ Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, Washington, USA.
² Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA.
³ School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA.
⁴ MDClone Ltd., Be'er Sheva, Israel.
⁵ Department of Obstetrics and Gynecology, New York University Grossman School of Medicine, New York, New York, USA.

PMID: 35357487
PMCID: PMC8992357
DOI: 10.1093/jamia/ocac045

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)

Jason A Thomas et al. J Am Med Inform Assoc. 2022.

. 2022 Jul 12;29(8):1350-1365.

doi: 10.1093/jamia/ocac045.

Authors

Jason A Thomas¹, Randi E Foraker^{2

3}, Noa Zamstein⁴, Jon D Morrow^{4

5}, Philip R O Payne^{2

3}, Adam B Wilcox^{2

3}; N3C Consortium

Collaborators

N3C Consortium:
Melissa A Haendel, Christopher G Chute, Kenneth R Gersing, Anita Walden, Melissa A Haendel, Tellen D Bennett, Christopher G Chute, David A Eichmann, Justin Guinney, Warren A Kibbe, Hongfang Liu, Philip R O Payne, Emily R Pfaff, Peter N Robinson, Joel H Saltz, Heidi Spratt, Justin Starren, Christine Suver, Adam B Wilcox, Andrew E Williams, Chunlei Wu, Christopher G Chute, Emily R Pfaff, Davera Gabriel, Stephanie S Hong, Kristin Kostka, Harold P Lehmann, Richard A Moffitt, Michele Morris, Matvey B Palchuk, Xiaohan Tanner Zhang, Richard L Zhu, Emily R Pfaff, Benjamin Amor, Mark M Bissell, Marshall Clark, Andrew T Girvin, Stephanie S Hong, Kristin Kostka, Adam M Lee, Robert T Miller, Michele Morris, Matvey B Palchuk, Kellie M Walters, Anita Walden, Yooree Chae, Connor Cook, Alexandra Dest, Racquel R Dietz, Thomas Dillon, Patricia A Francis, Rafael Fuentes, Alexis Graves, Julie A McMurry, Andrew J Neumann, Shawn T O'Neil, Usman Sheikh, Andréa M Volz, Elizabeth Zampino, Christopher P Austin, Kenneth R Gersing, Samuel Bozzette, Mariam Deacy, Nicole Garbarini, Michael G Kurilla, Sam G Michael, Joni L Rutter, Meredith Temple-O'Connor, Benjamin Amor, Mark M Bissell, Katie Rebecca Bradwell, Andrew T Girvin, Amin Manna, Nabeel Qureshi, Mary Morrison Saltz, Christine Suver, Christopher G Chute, Melissa A Haendel, Julie A McMurry, Andréa M Volz, Anita Walden, Carolyn Bramante, Jeremy Richard Harper, Wenndy Hernandez, Farrukh M Koraishy, Federico Mariona, Saidulu Mattapally, Amit Saha, Satyanarayana Vedula, Yujuan Fu, Nisha Mathews, Ofer Mendelevitch

Affiliations

¹ Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, Washington, USA.
² Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA.
³ School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA.
⁴ MDClone Ltd., Be'er Sheva, Israel.
⁵ Department of Obstetrics and Gynecology, New York University Grossman School of Medicine, New York, New York, USA.

PMID: 35357487
PMCID: PMC8992357
DOI: 10.1093/jamia/ocac045

Abstract

Objective: This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses.

Materials and methods: Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated.

Results: In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased.

Discussion: Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression.

Conclusion: In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.

Keywords: COVID-19; data sharing; data utility; electronic health records; synthetic data.

PubMed Disclaimer

Figures

**Figure 1.**
Aggregate epidemic curves with counts (vertical bars) and 7-day moving averages (smoothed line) for (A) tests, (B) cases, (C) percent positive, (D) admissions, and (E) deaths during admission. Color encodings include original data (light blue) and synthetic data (light red), with their overlap (purple). As counts get smaller from tests to deaths, the epidemic curves visually appear less similar.

**Figure 2.**
Distributions of total tests by zip code shown by original data (light blue) and synthetic data (light red), and their overlap (purple). (A) All data binned by 100. (B) Filtered data with a bin size of 10 to only show the distribution of tests by zip code in zip codes with <100 tests. Both y-axes use a log scale. As seen in panel A, the vast majority of tests are conducted in a minority of zip codes. As seen in panels A and B, the distribution of the synthetic data closely matches the original data at >10 tests per zip code.

**Figure 3.**
Zip code-level epidemic curves with counts (vertical bars) and 7-day moving averages (smoothed line). Color encodings include original data (light blue) and synthetic data (light red), with their overlap (purple). Each row (A–E) corresponds to a different randomly sampled zip code visualizing cases (left column) and admissions (right column). Synthetic data are more similar to original data when indicator density is higher. Overall, synthetic data closely match overall trends and closely match start and end dates.

**Figure 4.**
Zip code-level epidemic curves with counts (vertical bars) and 7-day moving averages (smoothed line). Color encodings include original data (light blue) and synthetic data (light red), with their overlap (purple). Each row (A–E) corresponds to a different randomly sampled zip code visualizing cases (left column) and admissions (right column). Synthetic data are more similar to original data when indicator density is higher. Overall, synthetic data closely match overall trends and closely match start and end dates.

**Figure 5.**
Workflow of synthetic error experiment showing synthetic data on the left, original data on the right which are then merged to allow the calculation of synthetic error to be made.

**Figure 6.**
Synthetic error distributions per zip code stratified by month for tests (top row), cases (middle row), and admissions (bottom row) shown both at original scale (left column) and zoomed in to the peak of each row’s middle bin (legend showing bin ranges and color encodings seen on the far right of each row). Original data value denotes the monthly count in the original data for the key indicator of interest. Box plots of synthetic error are shown in the top 30% of each sub-plot (A–F), with a histogram of synthetic error shown in the bottom 70%. Within each sub-plot, the box plot and histogram have a shared x-axis corresponding to synthetic error and shared bins corresponding to the original data value. The y-axis shows the number of zip codes stratified by month (eg, zip code month pairs). Boxes in the box plots span from Q1 to Q3, with median marked inside the box. Fences span ±1.5 times the IQR. Error increased as the size (count) of the original data increased, which allows users to estimate the level of error in their data of interest. The synthetic data systematically underestimate the monthly count of key indicators in zip codes with the most tests, cases, and deaths, and overestimate them in zip codes with the least.

See this image and copyright information in PMC

Update of

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).
Thomas JA, Foraker RE, Zamstein N, Payne PRO, Wilcox AB; N3C Consortium. Thomas JA, et al. medRxiv [Preprint]. 2021 Jul 8:2021.07.06.21259051. doi: 10.1101/2021.07.06.21259051. medRxiv. 2021. Update in: J Am Med Inform Assoc. 2022 Jul 12;29(8):1350-1365. doi: 10.1093/jamia/ocac045. PMID: 34268525 Free PMC article. Updated. Preprint.

References

1. Azzopardi-Muscat N, Kluge HHP, Asma S, et al. A call to strengthen data in response to COVID-19 and beyond. J Am Med Inform Assoc 2021; 28 (3): 638–9. - PMC - PubMed
1. Subbian V, Solomonides A, Clarkson M, et al. Ethics and informatics in the age of COVID-19: challenges and recommendations for public health organization and public policy. J Am Med Inform Assoc 2021; 28 (1): 184–9. - PMC - PubMed
1. Haendel MA, Chute CG, Bennett TD, et al.; N3C Consortium. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. J Am Med Inform Assoc 2021; 28 (3): 427–43. - PMC - PubMed
1. The National COVID Cohort Collaborative: Clinical Characterization and Early Severity Prediction | medRxiv. https://www.medrxiv.org/content/10.1101/2021.01.12.21249511v3 Accessed March 1, 2021. - DOI
1. HIPAA Privacy Rule and its Impacts on Research. https://privacyruleandresearch.nih.gov/pr_08.asp Accessed March 17, 2021.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)

Collaborators

Affiliations

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)

Authors

Collaborators

Affiliations

Abstract

Figures

Update of

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous