Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan:125:103977.
doi: 10.1016/j.jbi.2021.103977. Epub 2021 Dec 14.

Membership inference attacks against synthetic health data

Affiliations

Membership inference attacks against synthetic health data

Ziqi Zhang et al. J Biomed Inform. 2022 Jan.

Abstract

Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.

Keywords: Contrastive representation learning; Electronic health record; Membership inference; Synthetic data.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1:
Figure 1:
An illustration of membership inference against a machine learning model (upper), and against synthetic data (lower). The dashed box indicates the resource that can be used for inference. The shaded box represents the machine learning models.
Figure 2:
Figure 2:
This figure depicts a patient’s health record, in which each box represents an episode, and the horizontal arrows represent the timeline. Suppose both the third and fourth episodes are unique across the dataset. Either can be considered as a signature for the patient with respect to the dataset. However, only the third episode is considered to be strong because the occurrence of this episode depends on the previous two episodes.
Figure 3:
Figure 3:
A procedural depiction of the membership inference framework. The black arrows indicate the training process while the red arrows indicate inference using the trained models.
Figure 4:
Figure 4:
A summary of the membership inference risk against synthetic data (CRL-proxy). Each cell corresponds to a subset of all individuals who could be targeted by the adversary.
Figure 5:
Figure 5:
An illustration of membership inference risk against synthetic data (CRL-proxy), where the auxiliary data are used to train the adversarial model.
Figure 6:
Figure 6:
Membership inference results for LE (baseline 1).
Figure 7:
Figure 7:
Membership inference results for GRL (baseline 2).
Figure 8:
Figure 8:
Membership inference results for CRL-local (ablation).

References

    1. Rubun DB, Discussion statistical disclosure limitation, Journal of Official Statistics 9 (2) (1993) 461–468. URL http://www.jos.nu/Articles/abstract.asp?article=92469
    1. Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L, Privacy: Theory meets practice on the map, in: Proceedings - International Conference on Data Engineering, 2008, pp. 277–286. doi:10.1109/ICDE.2008.4497436. - DOI
    1. Park N, Mohammadi M, Gorde K, Jajodia S, Park H, Kim Y, Data synthesis based on generative adversarial networks, in: Proceedings of the VLDB Endowment, Vol. 11, Association for Computing Machinery, 2018, pp. 1071–1083. arXiv:1806.03384, doi:10.14778/3231751.3231757. - DOI
    1. Beaulieu-Jones BK, Wu ZS, Williams C, Lee R, Bhavnani SP, Byrd JB, Greene CS, Privacy-preserving generative deep neural networks support clinical data sharing, Circulation: Cardiovascular Quality and Outcomes 12 (7). doi:10.1161/CIRCOUTCOMES.118.005122. - DOI - PMC - PubMed
    1. Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J, Generating Multi-label Discrete Patient Records using Generative Adversarial Networks, arXiv 68. arXiv:1703.06490. URL http://arxiv.org/abs/1703.06490

Publication types

LinkOut - more resources