Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 23;8(1):49.
doi: 10.1038/s41746-025-01431-6.

Preserving information while respecting privacy through an information theoretic framework for synthetic health data generation

Affiliations

Preserving information while respecting privacy through an information theoretic framework for synthetic health data generation

Nadir Sella et al. NPJ Digit Med. .

Abstract

Generating synthetic data from medical records is a complex task intensified by patient privacy concerns. In recent years, multiple approaches have been reported for the generation of synthetic data, however, limited attention was given to jointly evaluate the quality and the privacy of the generated data. The quality and privacy of synthetic data stem from multivariate associations across variables, which cannot be assessed by comparing univariate distributions with the original data. Here, we introduce a novel algorithm (MIIC-SDG) for generating synthetic data from electronic records based on a multivariate information framework and Bayesian network theory. We also propose a new metric to quantitatively assess the trade-off between the Quality and Privacy Scores (QPS) of synthetic data generation methods. The performance of MIIC-SDG is demonstrated on different clinical datasets and favorably compares with state-of-the-art synthetic data generation methods, based on the QPS trade-off between several quality and privacy metrics.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests

Figures

Fig. 1
Fig. 1. MIIC-SDG pipeline.
This illustration shows the complete data generation process, from the original data table to the generated data, following the 3 main steps described in section “MIIC network reconstruction” of Methods. a Execution of the MIIC algorithm from the original data table. This step generates a graph where nodes represent the variables of the data matrix and edges represent direct associations between variables. b Transformation of the graph into a directed acyclic graph (DAG) through the MIIC-to-DAG algorithm. c Generation of the data using the original data table and the reconstructed DAG. d Details on the data generation step: each scenario takes into account the variable type of the target and parents nodes to adapt the sampling procedure (branches (af) are described in the section “MIIC-synthetizer” of Methods. RF: Random Forest; Cond. Prob. Table: Conditional Probability Table; Prob. Table: Probability Table; Emp. Dens.: Empirical density).
Fig. 2
Fig. 2. Network reconstructed by MIIC-to-DAG starting from the network obtained by MIIC from the full METABRIC dataset.
The network is learned with the MIIC-to-DAG algorithm (step B of MIIC-SDG) starting from the network obtained by MIIC, Supplementary Fig. 1, with the parameters defined in Supplementary Table 1. This network corresponds to the directed acyclic graph used to generate the synthetic data for the step C of MIIC-SDG. This network is visible at the following address: https://miic.curie.fr/job_results_NL.php?id=METABRIC_DAG.
Fig. 3
Fig. 3. Correlation matrices evaluated on 1000 samples for METABRIC dataset.
Correlation for each x, y combination is evaluated as the mean value over all executions with the same sample size.
Fig. 4
Fig. 4. Features permutation importance to predict overall survival.
We used a Survival Random Forest model fitted on a set of 1977 patients from the METABRIC dataset.
Fig. 5
Fig. 5. K-Fold Cross-validated c-index estimates.
This analysis was made using a Survival Random Forest model to predict Overall Survival in the METABRIC dataset in function of sample size (K = 5).
Fig. 6
Fig. 6. Quality, privacy and quality-privacy scores (QPS).
This comparison is made using Mutual Information distance as quality measure and privacy evaluated using identifiability and membership inference scores.
Fig. 7
Fig. 7. Quality, privacy and quality-privacy scores (QPS) on IMvigor210.
This comparison is made using Mutual Information distance as quality measure and privacy evaluated using identifiability and membership inference scores. Data from the IMvigor210 trial (Bladder cancer).
Fig. 8
Fig. 8. Meta-QPS scores of each dataset (IMvigor210, Metabric and Diabetes).
These plots summarize the benchmark results by integrating all quality and privacy scores into single metaQPS metrics for each dataset and all sample sizes analyzed in this study. MetaQPS.am corresponds to the F1-score between the arithmetic mean of quality scores and the arithmetic mean of the privacy scores. MetaQPS.hm corresponds to the F1-score between the harmonic mean of quality scores and the harmonic mean of the privacy scores (see details in Methods).

References

    1. Samarati, P. & Sweeney, L. Protecting Privacy When Disclosing Information: k-Anonymity and Its Enforcement Through Generalization and Suppression.https://www.semanticscholar.org (1998).
    1. Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data1, 3-es (2007).
    1. Li, N., Li, T. & Venkatasubramanian, S. t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering 106–115 (IEEE, 2007).
    1. Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. & Rankin, D. Synthetic data generation for tabular health records: a systematic review. Neurocomputing493, 28–45 (2018).
    1. Tucker, A., Wang, Z., Rotalinti, Y. & Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. Npj Digit. Med.3, 1–13 (2020). - PMC - PubMed

LinkOut - more resources