Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets
- PMID: 38011617
- PMCID: PMC10703127
- DOI: 10.1200/CCI.23.00116
Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets
Abstract
Purpose: There is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques.
Methods: We synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk.
Results: Utility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models.
Discussion: Synthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.
Conflict of interest statement
The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to
Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (
No other potential conflicts of interest were reported.
Figures
Similar articles
-
An evaluation of the replicability of analyses using synthetic health data.Sci Rep. 2024 Mar 24;14(1):6978. doi: 10.1038/s41598-024-57207-7. Sci Rep. 2024. PMID: 38521806 Free PMC article.
-
Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study.JMIR Med Inform. 2022 Apr 7;10(4):e35734. doi: 10.2196/35734. JMIR Med Inform. 2022. PMID: 35389366 Free PMC article.
-
The project data sphere initiative: accelerating cancer research by sharing data.Oncologist. 2015 May;20(5):464-e20. doi: 10.1634/theoncologist.2014-0431. Epub 2015 Apr 15. Oncologist. 2015. PMID: 25876994 Free PMC article.
-
Federated learning for generating synthetic data: a scoping review.Int J Popul Data Sci. 2023 Oct 31;8(1):2158. doi: 10.23889/ijpds.v8i1.2158. eCollection 2023. Int J Popul Data Sci. 2023. PMID: 38414544 Free PMC article.
-
The urgent need to accelerate synthetic data privacy frameworks for medical research.Lancet Digit Health. 2025 Feb;7(2):e157-e160. doi: 10.1016/S2589-7500(24)00196-1. Epub 2024 Nov 26. Lancet Digit Health. 2025. PMID: 39603900 Review.
Cited by
-
Big data in breast cancer: Towards precision treatment.Digit Health. 2024 Nov 3;10:20552076241293695. doi: 10.1177/20552076241293695. eCollection 2024 Jan-Dec. Digit Health. 2024. PMID: 39502482 Free PMC article. Review.
-
An evaluation of synthetic data augmentation for mitigating covariate bias in health data.Patterns (N Y). 2024 Feb 29;5(4):100946. doi: 10.1016/j.patter.2024.100946. eCollection 2024 Apr 12. Patterns (N Y). 2024. PMID: 38645766 Free PMC article.
-
An evaluation of the replicability of analyses using synthetic health data.Sci Rep. 2024 Mar 24;14(1):6978. doi: 10.1038/s41598-024-57207-7. Sci Rep. 2024. PMID: 38521806 Free PMC article.
-
Augmenting Insufficiently Accruing Oncology Clinical Trials Using Generative Models: Validation Study.J Med Internet Res. 2025 Mar 5;27:e66821. doi: 10.2196/66821. J Med Internet Res. 2025. PMID: 40053790 Free PMC article.
-
The REthinking Clinical Trials Program Retreat 2023: Creating Partnerships to Optimize Quality Cancer Care.Curr Oncol. 2024 Mar 6;31(3):1376-1388. doi: 10.3390/curroncol31030104. Curr Oncol. 2024. PMID: 38534937 Free PMC article.
References
-
- Ebrahim S, Sohani ZN, Montoya L, et al. : Reanalyses of randomized clinical trial data. JAMA 312:1024-1032, 2014 - PubMed
-
- Phrma and E.F.P.I.A. : Principles for responsible clinical trial data sharing, 2013. http://www.phrma.org/sites/default/files/pdf/PhRMAPrinciplesForResponsib...
-
- E. M. Agency : European Medicines Agency Policy on publication of data for medicinal products for human use: Policy 0070, 2014. http://www.ema.europa.eu/docs/en_GB/document_library/Other/2014/10/WC500...
-
- Taichman DB, Backus J, Baethge C, et al. : Sharing clinical trial data: A proposal from the International Committee of Medical Journal Editors. Ann Intern Med 164:505-506, 2016 - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical