Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Jul 9:23:2892-2910.
doi: 10.1016/j.csbj.2024.07.005. eCollection 2024 Dec.

Synthetic data generation methods in healthcare: A review on open-source tools and methods

Affiliations
Review

Synthetic data generation methods in healthcare: A review on open-source tools and methods

Vasileios C Pezoulas et al. Comput Struct Biotechnol J. .

Abstract

Synthetic data generation has emerged as a promising solution to overcome the challenges which are posed by data scarcity and privacy concerns, as well as, to address the need for training artificial intelligence (AI) algorithms on unbiased data with sufficient sample size and statistical power. Our review explores the application and efficacy of synthetic data methods in healthcare considering the diversity of medical data. To this end, we systematically searched the PubMed and Scopus databases with a great focus on tabular, imaging, radiomics, time-series, and omics data. Studies involving multi-modal synthetic data generation were also explored. The type of method used for the synthetic data generation process was identified in each study and was categorized into statistical, probabilistic, machine learning, and deep learning. Emphasis was given to the programming languages used for the implementation of each method. Our evaluation revealed that the majority of the studies utilize synthetic data generators to: (i) reduce the cost and time required for clinical trials for rare diseases and conditions, (ii) enhance the predictive power of AI models in personalized medicine, (iii) ensure the delivery of fair treatment recommendations across diverse patient populations, and (iv) enable researchers to access high-quality, representative multimodal datasets without exposing sensitive patient information, among others. We underline the wide use of deep learning based synthetic data generators in 72.6 % of the included studies, with 75.3 % of the generators being implemented in Python. A thorough documentation of open-source repositories is finally provided to accelerate research in the field.

Keywords: Artificial intelligence; Data privacy; Healthcare; Imaging data; Multimodal data; Omics data; Radiomics data; Synthetic data generation; Tabular data; Time-series data.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

ga1
Graphical abstract
Fig. 1
Fig. 1
PRISMA flowchart for the systematic review including the database searches, the number of abstracts screened, and the full texts retrieved.
Fig. 2
Fig. 2
The four stages of the synthetic data generation workflow.
Fig. 3
Fig. 3
An overview of: (A) the total number of synthetic data generation studies in healthcare per year by PubMed and Scopus, and (B) the final number of studies across different data types (five main data types and multimodal data cases) by PubMed and Scopus.
Fig. 4
Fig. 4
Overview of methods and programming languages used for synthetic data generation in healthcare: (A) Types of methods used in the studies, (B) Programming languages used for the implementation.

References

    1. Shilo S., Rossman H., Segal E. Axes of a revolution: challenges and promises of big data in healthcare. Nat Med. 2020;vol. 26(1):29–38. doi: 10.1038/s41591-019-0727-5. - DOI - PubMed
    1. Agrawal R., Prabakaran S. Big data in digital healthcare: lessons learnt and recommendations for general practice. Heredity. 2020;vol. 124(4):525–534. doi: 10.1038/s41437-020-0303-2. - DOI - PMC - PubMed
    1. Appenzeller A., Leitner M., Philipp P., Krempel E., Beyerer J. Privacy and utility of private synthetic data for medical data analyses. Appl Sci. 2022;vol. 12(23):12320. doi: 10.3390/app122312320. - DOI
    1. S.M. Bellovin, P.K. Dutta, N. Reitinger, Privacy and Synthetic Datasets, vol. 22.
    1. Yale A., Dash S., Dutta R., Guyon I., Pavao A., Bennett K.P. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing. 2020;vol. 416:244–255. doi: 10.1016/j.neucom.2019.12.136. - DOI

LinkOut - more resources