Spot the difference: comparing results of analyses from real patient data and synthetic derivatives
- PMID: 33623891
- PMCID: PMC7886551
- DOI: 10.1093/jamiaopen/ooaa060
Spot the difference: comparing results of analyses from real patient data and synthetic derivatives
Abstract
Background: Synthetic data may provide a solution to researchers who wish to generate and share data in support of precision healthcare. Recent advances in data synthesis enable the creation and analysis of synthetic derivatives as if they were the original data; this process has significant advantages over data deidentification.
Objectives: To assess a big-data platform with data-synthesizing capabilities (MDClone Ltd., Beer Sheva, Israel) for its ability to produce data that can be used for research purposes while obviating privacy and confidentiality concerns.
Methods: We explored three use cases and tested the robustness of synthetic data by comparing the results of analyses using synthetic derivatives to analyses using the original data using traditional statistics, machine learning approaches, and spatial representations of the data. We designed these use cases with the purpose of conducting analyses at the observation level (Use Case 1), patient cohorts (Use Case 2), and population-level data (Use Case 3).
Results: For each use case, the results of the analyses were sufficiently statistically similar (P > 0.05) between the synthetic derivative and the real data to draw the same conclusions.
Discussion and conclusion: This article presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in clinical research for faster insights and improved data sharing in support of precision healthcare.
Keywords: data analysis; electronic health records and systems; precision health care; protected health information; synthetic data.
© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Figures
Similar articles
-
The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data.J Med Internet Res. 2021 Oct 4;23(10):e30697. doi: 10.2196/30697. J Med Internet Res. 2021. PMID: 34559671 Free PMC article.
-
The future of Cochrane Neonatal.Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12. Early Hum Dev. 2020. PMID: 33036834
-
Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.JMIR Form Res. 2024 Apr 22;8:e53241. doi: 10.2196/53241. JMIR Form Res. 2024. PMID: 38648097 Free PMC article.
-
Patient Privacy in the Era of Big Data.Balkan Med J. 2018 Jan 20;35(1):8-17. doi: 10.4274/balkanmedj.2017.0966. Epub 2017 Sep 13. Balkan Med J. 2018. PMID: 28903886 Free PMC article. Review.
-
Machine learning and genomics: precision medicine versus patient privacy.Philos Trans A Math Phys Eng Sci. 2018 Sep 13;376(2128):20170350. doi: 10.1098/rsta.2017.0350. Philos Trans A Math Phys Eng Sci. 2018. PMID: 30082298 Review.
Cited by
-
Insulin Resistance and Impaired Insulin Secretion Predict Incident Diabetes: A Statistical Matching Application to the Two Korean Nationwide, Population-Representative Cohorts.Endocrinol Metab (Seoul). 2024 Oct;39(5):711-721. doi: 10.3803/EnM.2024.1986. Epub 2024 Aug 30. Endocrinol Metab (Seoul). 2024. PMID: 39212039 Free PMC article.
-
Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).medRxiv [Preprint]. 2021 Jul 8:2021.07.06.21259051. doi: 10.1101/2021.07.06.21259051. medRxiv. 2021. Update in: J Am Med Inform Assoc. 2022 Jul 12;29(8):1350-1365. doi: 10.1093/jamia/ocac045. PMID: 34268525 Free PMC article. Updated. Preprint.
-
A Multifaceted benchmarking of synthetic electronic health record generation models.Nat Commun. 2022 Dec 9;13(1):7609. doi: 10.1038/s41467-022-35295-1. Nat Commun. 2022. PMID: 36494374 Free PMC article.
-
Leveraging Artificial Intelligence and Synthetic Data Derivatives for Spine Surgery Research.Global Spine J. 2023 Oct;13(8):2409-2421. doi: 10.1177/21925682221085535. Epub 2022 Apr 3. Global Spine J. 2023. PMID: 35373623 Free PMC article.
-
Predicting mortality among patients with liver cirrhosis in electronic health records with machine learning.PLoS One. 2021 Aug 31;16(8):e0256428. doi: 10.1371/journal.pone.0256428. eCollection 2021. PLoS One. 2021. PMID: 34464403 Free PMC article.
References
-
- Nair S, Hsu D, Celi LA.. Challenges and opportunities in secondary analyses of electronic health record data. In: Data MIT Critical Data, ed. Secondary Analysis of Electronic Health Records. Cham: Springer International Publishing, 2016: 17–26. - PubMed
-
- Federal Policy for the Protection of Human Subjects ('Common Rule'). In: Code of Federal Regulations, ed. U.S. Department of Health and Human Services.
-
- The HIPAA Privacy Rule. In: Code of Federal Regulations, ed. U.S. Department of Health and Human Services.
-
- Miller AR, Tucker C.. Privacy protection and technology diffusion: the case of electronic medical records. Manag Sci 2009; 55 (7): 1077–93.
LinkOut - more resources
Full Text Sources