Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
- PMID: 36899082
- PMCID: PMC10006164
- DOI: 10.1038/s41746-023-00771-5
Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
Abstract
While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients' trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients' privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an "avatar data", for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment's effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39-0.63] vs. avatar HR = 0.40 [95% CI, 0.31-0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach.
© 2023. The Author(s).
Conflict of interest statement
M.G., J.P., C.A.D. are employees of Octopize. Z.B. is a former trainee at Octopize. P.A.G. is the founder of Methodomics (2008) and the co-founder of Big data Santé (2018). He consults for major pharmaceutical companies, and start-ups, all of which are handled through academic pipelines (AstraZeneca, Biogen, Boston Scientific, Cook, Docaposte, Edimark, Ellipses, Elsevier, Methodomics, Merck, Mérieux, Octopize, Sanofi-Genzyme). P.A.G. is a volunteer board member at AXA not-for-profit mutual insurance company (2021). He has no prescription activity with either drugs or devices.
Figures
References
-
- Gupta M, George JF. Toward the development of a big data analytics capability. Inf. Manag. 2016;53:1049–1064. doi: 10.1016/j.im.2016.07.004. - DOI
-
- Douriez, M., Doraiswamy, H., Freire, J. & Silva, C. T. Anonymizing NYC taxi data: does it matter? in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) 140–148 (2016).
-
- Lavrenovs, A. & Podins, K. Privacy violations in Riga open data public transport system. in 1–6 10.1109/AIEEE.2016.7821808 (2016).
LinkOut - more resources
Full Text Sources
