Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 31;8(1):2158.
doi: 10.23889/ijpds.v8i1.2158. eCollection 2023.

Federated learning for generating synthetic data: a scoping review

Affiliations

Federated learning for generating synthetic data: a scoping review

Claire Little et al. Int J Popul Data Sci. .

Abstract

Introduction: Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format.

Objectives: The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps.

Methods: A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk.

Results: A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data.

Conclusions: Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.

Keywords: data confidentiality; data utility; federated learning; review; synthetic data.

PubMed Disclaimer

Conflict of interest statement

Statement on conflicts of interest: The authors declare that they have no conflicts to report.

Figures

Figure 1: Structure of a typical GAN (Generative Adversarial Network)
Figure 1: Structure of a typical GAN (Generative Adversarial Network)
Figure 2: PRISMA flow diagram for the literature database and website search
Figure 2: PRISMA flow diagram for the literature database and website search
Figure 3: Characteristics of included papers (n = 69)
Figure 3: Characteristics of included papers (n = 69)
Figure 4: The type of data used by each paper, by the goal of the research
Figure 4: The type of data used by each paper, by the goal of the research
Figure 5: The different GAN (Generative Adversarial Network) configurations used on the server and client devices, for federated synthesis
Figure 5: The different GAN (Generative Adversarial Network) configurations used on the server and client devices, for federated synthesis

References

    1. Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Nordholt ES, Spicer K, et al.. Statistical Disclosure Control. John Wiley & Sons, Ltd; 2012. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118348239.
    1. Purdam K, Elliot M. A case study of the impact of statistical disclosure control on data quality in the individual UK Samples of Anonymised Records. Environ Plan A. 2007;39(5):1101–18. 10.1068/a38335 - DOI
    1. Drechsler J, Reiter JP. Sampling with synthesis: A new approach for releasing public use census microdata. J Am Stat Assoc. 2010;105(492):1347–57. 10.1198/jasa.2010.ap09480 - DOI
    1. Dwork C, Smith A, Steinke T, Ullman J. Exposed! A survey of attacks on private data. Annu Rev Stat Its Appl. 2017;4(1):61–84. 10.1146/annurev-statistics-060116-054123 - DOI
    1. Rocher L, Hendrickx JM, de Montjoye YA. Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun. 2019;10(3069). 10.1038/s41467-019-10933-3 - DOI - PMC - PubMed

Publication types

LinkOut - more resources