Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 1;26(3):228-241.
doi: 10.1093/jamia/ocy142.

Synthesizing electronic health records using improved generative adversarial networks

Affiliations

Synthesizing electronic health records using improved generative adversarial networks

Mrinal Kanti Baowaly et al. J Am Med Inform Assoc. .

Abstract

Objective: The aim of this study was to generate synthetic electronic health records (EHRs). The generated EHR data will be more realistic than those generated using the existing medical Generative Adversarial Network (medGAN) method.

Materials and methods: We modified medGAN to obtain two synthetic data generation models-designated as medical Wasserstein GAN with gradient penalty (medWGAN) and medical boundary-seeking GAN (medBGAN)-and compared the results obtained using the three models. We used 2 databases: MIMIC-III and National Health Insurance Research Database (NHIRD), Taiwan. First, we trained the models and generated synthetic EHRs by using these three 3 models. We then analyzed and compared the models' performance by using a few statistical methods (Kolmogorov-Smirnov test, dimension-wise probability for binary data, and dimension-wise average count for count data) and 2 machine learning tasks (association rule mining and prediction).

Results: We conducted a comprehensive analysis and found our models were adequately efficient for generating synthetic EHR data. The proposed models outperformed medGAN in all cases, and among the 3 models, boundary-seeking GAN (medBGAN) performed the best.

Discussion: To generate realistic synthetic EHR data, the proposed models will be effective in the medical industry and related research from the viewpoint of providing better services. Moreover, they will eliminate barriers including limited access to EHR data and thus accelerate research on medical informatics.

Conclusion: The proposed models can adequately learn the data distribution of real EHRs and efficiently generate realistic synthetic EHRs. The results show the superiority of our models over the existing model.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
ECDFs of ICD codes and patients for MIMIC-III, extended MIMIC-III, and NHIRD datasets.
Figure 2.
Figure 2.
Original GAN and medGAN architecture.
Figure 3.
Figure 3.
Scatterplots of dimension-wise probability results of real binary data (x-axis) vs. synthetic counterpart (y-axis) produced by the 3 generative models.
Figure 4.
Figure 4.
Scatterplots of dimension-wise average count results on real count data (x-axis) vs. synthetic counterpart (y-axis) produced by the 3 generative models.
Figure 5.
Figure 5.
Scatterplots of dimension-wise prediction results (F1-scores) of logistic regression model trained on real binary data (x-axis) vs. synthetic counterpart (y-axis) produced by the 3 generative models.
Figure 6.
Figure 6.
Scatterplots of dimension-wise prediction results (F1-scores) of logistic regression model trained on real count data (x-axis) vs. synthetic counterpart (y-axis) produced by the 3 generative models.

References

    1. Office for Civil Rights. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. U.S. Department of Health and Human Services, 20 November 2013. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-id.... Accessed March 12, 2017.
    1. Emam KE, Jonker E, Arbuckle L, et al. A systematic review of re-identification attacks on health data. PLoS One 2011; 6 (12): e28071. - PMC - PubMed
    1. Emam KE, Rodgers S, Malin B.. Anonymising and sharing individual patient data. Br Med J 2015; 350: h1139. - PMC - PubMed
    1. Walonoski J, Kramer M, Nichols J, et al. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc 2018; 25 (3): 230–8. - PMC - PubMed
    1. Lombardo JS, Moniz LJ.. A method for generation and distribution of synthetic medical record data for evaluation of disease-monitoring systems. Johns Hopkins APL Tech Digest 2008; 27 (4).