Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 13;30(1):16-25.
doi: 10.1093/jamia/ocac184.

Generating synthetic mixed discrete-continuous health records with mixed sum-product networks

Affiliations

Generating synthetic mixed discrete-continuous health records with mixed sum-product networks

Shannon K S Kroes et al. J Am Med Inform Assoc. .

Abstract

Objective: Privacy is a concern whenever individual patient health data is exchanged for scientific research. We propose using mixed sum-product networks (MSPNs) as private representations of data and take samples from the network to generate synthetic data that can be shared for subsequent statistical analysis. This anonymization method was evaluated with respect to privacy and information loss.

Materials and methods: Using a simulation study, information loss was quantified by assessing whether synthetic data could reproduce regression parameters obtained from the original data. Predictors variable types were varied between continuous, count, categorical, and mixed discrete-continuous. Additionally, we measured whether the MSPN approach successfully anonymizes the data by removing associations between background and sensitive information for these datasets.

Results: The synthetic data generated with MSPNs yielded regression results highly similar to those generated with original data, differing less than 5% in most simulation scenarios. Standard errors increased compared to the original data. Particularly for smaller datasets (1000 records), this resulted in a discrepancy between the estimated and empirical standard errors. Sensitive values could no longer be inferred from background information for at least 99% of tested individuals.

Discussion: The proposed anonymization approach yields very promising results. Further research is required to evaluate its performance with other types of data and analyses, and to predict how user parameter choices affect a bias-privacy trade-off.

Conclusion: Generating synthetic data from MSPNs is a promising, easy-to-use approach for anonymization of sensitive individual health data that yields informative and private data.

Keywords: anonymization; health data; mixed sum-product networks; privacy; synthetic data.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) Scatterplot of variables X1 and X2 of a bivariate normal distribution with correlation of 0.3 (n =10 000), where the color of each data point corresponds to the cluster to which it has been assigned. (B) The corresponding MSPN, containing one sum node and 6 product nodes, with the histograms representing the probability densities of the (independent) univariate distributions.
Figure 2.
Figure 2.
Illustration of the simulation study that evaluates the performance of the proposed anonymization method with MSPNs. X = original dataset; X’ = corresponding synthetic dataset; n = number of records in X (either 1000, 10 000, or 100 000); β = true value of regression parameter of interest (either zero or nonzero); multivariate distribution of predictors is varied between continuous, count-valued, discrete, or mixed; β^ = parameter of interest as estimated with X; β^= parameter of interest as estimated with X’; Hj = histogram of the jth variable of X within cluster; γ = number of clusters; PoAC = Proportion of Alternatives Considered (privacy measure for unordered variables); ED = expected deviation (privacy measure for ordered variables). An arrow with a function name (capital first letter) points from its input to its output. Note: The function names in this figure correspond to the code in the GitHub repository: (https://github.com/ShannonKroes/MSPN_privacy).
Figure 3.
Figure 3.
Simulation results of regression analyses with a continuous outcome. Distribution of estimated regression parameters is depicted for analyses with the original and anonymized data, for 4 predictor distributions (rows) and 3 sample sizes (columns) for 2 values of the parameter.

References

    1. Torfi A, Fox EA. CorGAN: correlation-capturing convolutional generative adversarial networks for generating synthetic healthcare records. In: The Thirty-Third International Flairs Conference. 2020; Miami, FL, USA.
    1. Piacentino E, Angulo C.. Generating fake data using GANs for anonymizing healthcare data. In: International Work-Conference on Bioinformatics and Biomedical Engineering. Granada: Springer; 2020: 406–417.
    1. Baowaly MK, Lin C-C, Liu C-L, Chen K-T.. Synthesizing electronic health records using improved generative adversarial networks. J Am Med Inform Assoc 2019; 26 (3): 228–41. - PMC - PubMed
    1. Park Y, Ghosh J.. PeGS: perturbed gibbs samplers that generate privacy-compliant synthetic data. Trans. Data Priv 2014; 7 (3): 253–82.
    1. Drechsler J. Using support vector machines for generating synthetic datasets. In: International Conference on Privacy in Statistical Databases. Corfu: Springer; 2010: 148–161.

Publication types