Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 1;27(9):1411-1419.
doi: 10.1093/jamia/ocaa119.

Generating sequential electronic health records using dual adversarial autoencoder

Affiliations

Generating sequential electronic health records using dual adversarial autoencoder

Dongha Lee et al. J Am Med Inform Assoc. .

Abstract

Objective: Recent studies on electronic health records (EHRs) started to learn deep generative models and synthesize a huge amount of realistic records, in order to address significant privacy issues surrounding the EHR. However, most of them only focus on structured records about patients' independent visits, rather than on chronological clinical records. In this article, we aim to learn and synthesize realistic sequences of EHRs based on the generative autoencoder.

Materials and methods: We propose a dual adversarial autoencoder (DAAE), which learns set-valued sequences of medical entities, by combining a recurrent autoencoder with 2 generative adversarial networks (GANs). DAAE improves the mode coverage and quality of generated sequences by adversarially learning both the continuous latent distribution and the discrete data distribution. Using the MIMIC-III (Medical Information Mart for Intensive Care-III) and UT Physicians clinical databases, we evaluated the performances of DAAE in terms of predictive modeling, plausibility, and privacy preservation.

Results: Our generated sequences of EHRs showed the comparable performances to real data for a predictive modeling task, and achieved the best score in plausibility evaluation conducted by medical experts among all baseline models. In addition, differentially private optimization of our model enables to generate synthetic sequences without increasing the privacy leakage of patients' data.

Conclusions: DAAE can effectively synthesize sequential EHRs by addressing its main challenges: the synthetic records should be realistic enough not to be distinguished from the real records, and they should cover all the training patients to reproduce the performance of specific downstream tasks.

Keywords: differential privacy; electornic health records (EHRs); generative adversarial networks (GANs); generative autoencoder; sequential data generation.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The dual adversarial autoencoder architecture that is composed of the sequence-to-sequence autoencoder, the inner generative adversarial network, and the outer generative adversarial network.
Figure 2.
Figure 2.
The detailed architecture of the sequence-to-sequence autoencoder in dual adversarial autoencoder architecture. The embedding layer encodes the semantics of all the medical entities, and the gated recurrent unit layer learns the temporal contexts within patients’ sequential records.
Figure 3.
Figure 3.
Plausibility scores evaluated by medical experts. (a) Dataset: MIMIC-III (b) Dataset: UTP. ARAE: adversarially regularized autoencoder; DAAE: dual adversarial autoencoder; medGAN: medical generative adversarial network; VAE: variational autoencoder; WAE: Wasserstein autoencoder.
Figure 4.
Figure 4.
The code distribution (gray points) with identified modes (colored points). (a) VAE, (b) WAE, (c) ARAE, and (d) DAAE. Dataset: UT Physicians. ARAE: adversarially regularized autoencoder; DAAE: dual adversarial autoencoder; medGAN: medical generative adversarial network; VAE: variational autoencoder; WAE: Wasserstein autoencoder.
Figure 5.
Figure 5.
Sequence modeling accuracies achieved by synthetic sequences with different privacy cost. (a) Dataset: MIMIC-III, (b) Dataset: UTP.

References

    1. El Emam K, Rodgers S, Malin B. Anonymising and sharing individual patient data. BMJ 2015; 350: h1139. - PMC - PubMed
    1. El Emam K, Jonker E, Arbuckle L, et al. A systematic review of re-identification attacks on health data. PLoS One 2011; 6 (12): e28071. - PMC - PubMed
    1. El Emam K, Dankar FK, Neisa A, et al. Evaluating the risk of patient re-identification from adverse drug event reports. BMC Med Inform Decis Mak 2013; 13 (1): 114. - PMC - PubMed
    1. Dankar FK, El Emam K, Neisa A, et al. Estimating the re-identification risk of clinical data sets. BMC Med Inform Decis Mak 2012; 12 (1): 66. - PMC - PubMed
    1. Simon GE, Shortreed SM, Coley RY, et al. Assessing and minimizing re-identification risk in research data derived from health care records. EGEMS (Wash DC) 2019; 7 (1): 6. - PMC - PubMed

Publication types