Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis

Affiliations

¹ Octopize, Mimethik Data, Nantes, France.
² Nantes Université, INSERM, CHU de Nantes, Ecole Centrale de Nantes,Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, Nantes, France.
³ Nantes Université, CHU de Nantes, INSERM, CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, Nantes, France.
⁴ Nantes Université, INSERM, CHU de Nantes, Ecole Centrale de Nantes,Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, Nantes, France. pierre-antoine.gourraud@univ-nantes.fr.
⁵ Nantes Université, CHU de Nantes, INSERM, CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, Nantes, France. pierre-antoine.gourraud@univ-nantes.fr.

PMID: 36899082
PMCID: PMC10006164
DOI: 10.1038/s41746-023-00771-5

Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis

Morgan Guillaudeux et al. NPJ Digit Med. 2023.

. 2023 Mar 10;6(1):37.

doi: 10.1038/s41746-023-00771-5.

Authors

Affiliations

¹ Octopize, Mimethik Data, Nantes, France.
² Nantes Université, INSERM, CHU de Nantes, Ecole Centrale de Nantes,Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, Nantes, France.
³ Nantes Université, CHU de Nantes, INSERM, CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, Nantes, France.
⁴ Nantes Université, INSERM, CHU de Nantes, Ecole Centrale de Nantes,Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, Nantes, France. pierre-antoine.gourraud@univ-nantes.fr.
⁵ Nantes Université, CHU de Nantes, INSERM, CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, Nantes, France. pierre-antoine.gourraud@univ-nantes.fr.

PMID: 36899082
PMCID: PMC10006164
DOI: 10.1038/s41746-023-00771-5

Abstract

While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients' trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients' privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an "avatar data", for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment's effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39-0.63] vs. avatar HR = 0.40 [95% CI, 0.31-0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach.

PubMed Disclaimer

Conflict of interest statement

M.G., J.P., C.A.D. are employees of Octopize. Z.B. is a former trainee at Octopize. P.A.G. is the founder of Methodomics (2008) and the co-founder of Big data Santé (2018). He consults for major pharmaceutical companies, and start-ups, all of which are handled through academic pipelines (AstraZeneca, Biogen, Boston Scientific, Cook, Docaposte, Edimark, Ellipses, Elsevier, Methodomics, Merck, Mérieux, Octopize, Sanofi-Genzyme). P.A.G. is a volunteer board member at AXA not-for-profit mutual insurance company (2021). He has no prescription activity with either drugs or devices.

Figures

**Fig. 1. Comparative results of analyses based on original and avatar data.**
a, b FAMD projections of the (a) AIDS (k = 20) and b WBCD (k = 20) avatar data in the original data space (original data in orange dots, avatar data in green dots). Avatar and original data are overlaid and share the same space built from the original observations. c Distributions of times to events were estimated using Kaplan Meier estimate of the time-to event- function and compared with the log-rank test and Cox proportional-hazards model, with a comparison between the original (plain lines) and AIDS avatar data (dotted lines) for arms 0 (purple lines) and 1 (red lines). The statistical p-values are computed using Wald test. The original and avatar WBCD datasets were separated into 70 training trials and 30 tests (100 times). d Comparison of original (orange bars) and avatar (green bars) F-scores for each variable. Error bars represent the 95% confidence interval. SVM machine-learning models were performed using five features selected by F-score. The AUC is presented for the original and avatar datasets. Supplementary Tables 1 and 2 show additional statistics. FAMD factor analysis for mixed data, AUC area under the ROC curve, SVM support vector machine, CI confidence interval, HR hazard ratio.

**Fig. 2. Comparative results of utility and privacy for original avatar datasets, Synthpop, and CT-GAN data.**
a Hazard ratio between arm 0 and arm 1 per synthetic data generation method comparison (Avatar method: green, Synthpop: purple, CT-GAN: blue) with original reference (orange). Error bars represent the 95% confidence interval. b Boxplot comparison of F-scores obtained in SVM models per variable and per synthetic data generation method (Avatar ethod: green, Synthpop: purple, CT-GAN: blue) over 100 iterations with original reference (orange). Boxplots present the median, first, and third quartiles. Minimum whisker equals (Q1–1.5*IQR) and maximum equals (Q3 + 1.5*IQR). c, d Summary table (c) for AIDS and (d) for WBCD, of DCR and NNDR median values and quantiles (0.05–0.95) for the three synthetic data generation methods. Original is obtained by applying both metrics on original 70% sampling and 30% holdout original data. AUC area under the ROC curve, CI confidence interval, Q1 first quartile, Q3 third quartile, IQR interquartile range, DCR distance to the closest record, NNDR nearest neighbor distance ratio, q0.05 5th percentile, q0.95 95th percentile, SVM support vector machine, CT-GAN conditional tabular generative adversarial network.

**Fig. 3. Quantification of re-identification risk of sensitive data using the avatar dataset.**
The risk of re-identifying an individual in the avatar dataset is near zero. a, b Distribution of the local cloaking for a AIDS (hidden rate: 93%, median = 11) and b WBCD (hidden rate: 94%, median = 24). c, d show the histograms of individuals according to the number of times they had a local cloaking of zero for the c AIDS and d WBCD datasets. In both cases, the experiment was conducted on 25 independent avatar simulations (k = 20).

**Fig. 4. Influence of k on statistical relevance and re-identification risk.**
High k values lower the preservation of statistical information of the dataset while enhancing privacy: a, b FAMD projections of a two AIDS avatar simulations with k = 4 (light green dots) and k = 1166 (dark green dots) and b two WBCD avatar simulations with k = 4 (light green dots) and k = 342 (dark green dots) in their original data FAMD projection space. Contrary to Fig. 1 a, b, Fig. 4 a, b only present avatar data. c Hazard ratio evolution for arm 1 compared with arm 0 as a function of k. The green zone represents the 95% CI of the hazard ratio mean. The orange line represents the original data results. d Accuracy evolution as a function of k. For each k, 10 train/test datasets (70/30) SVM models were computed. Green zones represent 95% CI. Orange lines and associated areas represent the original data AUC mean and associated 95% CI. A high k influence on data privacy. e, f Comparison of the local cloaking distribution (base-10 log scale) for low k to high k. Boxplots present the median, first, and third quartiles. FAMD factor analysis for mixed data, AUC area under the ROC curve, SVM support vector machine, CI confidence interval.

**Fig. 5. The Avatar method uses local modeling to stochastically generate a synthetic individual, termed an avatar simulation.**
**(1)** Original pseudonymized sensitive data. **(2)** The core of the Avatar method consists of four steps: (a) individuals are projected in a multidimensional space; (b) pairwise distances are computed to find the k nearest neighbors (here k = 12) in a reduced space; (c) a synthetic individual is pseudo-randomly generated in the subspace defined by the neighbors; (d) privacy metrics are evaluated. **(3)** Output of the dataset of synthetic data. More details are provided online (https://docs.octopize.io/).

See this image and copyright information in PMC

References

1. Gupta M, George JF. Toward the development of a big data analytics capability. Inf. Manag. 2016;53:1049–1064. doi: 10.1016/j.im.2016.07.004. - DOI
1. Rocher L, Hendrickx JM, de Montjoye Y-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 2019;10:3069. doi: 10.1038/s41467-019-10933-3. - DOI - PMC - PubMed
1. de Montjoye Y-A, Hidalgo CA, Verleysen M, Blondel VD. Unique in the Crowd: the privacy bounds of human mobility. Sci. Rep. 2013;3:1376. doi: 10.1038/srep01376. - DOI - PMC - PubMed
1. Douriez, M., Doraiswamy, H., Freire, J. & Silva, C. T. Anonymizing NYC taxi data: does it matter? in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) 140–148 (2016).
1. Lavrenovs, A. & Podins, K. Privacy violations in Riga open data public transport system. in 1–6 10.1109/AIEEE.2016.7821808 (2016).

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis

Affiliations

Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources