. 2023 Oct 31;8(1):2158.

doi: 10.23889/ijpds.v8i1.2158. eCollection 2023.

Federated learning for generating synthetic data: a scoping review

Claire Little¹, Mark Elliot², Richard Allmendinger³

Affiliations

¹ Cathie Marsh Institute for Social Research, School of Social Sciences, University of Manchester, Oxford Road, M13 9PL, Manchester, UK.
² Department of Social Statistics, School of Social Sciences, University of Manchester, Oxford Road, M13 9PL, Manchester, UK.
³ Alliance Manchester Business School, University of Manchester, Oxford Road, M13 9PL, Manchester, UK.

PMID: 38414544
PMCID: PMC10898505
DOI: 10.23889/ijpds.v8i1.2158

Federated learning for generating synthetic data: a scoping review

Claire Little et al. Int J Popul Data Sci. 2023.

. 2023 Oct 31;8(1):2158.

doi: 10.23889/ijpds.v8i1.2158. eCollection 2023.

Authors

Claire Little¹, Mark Elliot², Richard Allmendinger³

Affiliations

¹ Cathie Marsh Institute for Social Research, School of Social Sciences, University of Manchester, Oxford Road, M13 9PL, Manchester, UK.
² Department of Social Statistics, School of Social Sciences, University of Manchester, Oxford Road, M13 9PL, Manchester, UK.
³ Alliance Manchester Business School, University of Manchester, Oxford Road, M13 9PL, Manchester, UK.

PMID: 38414544
PMCID: PMC10898505
DOI: 10.23889/ijpds.v8i1.2158

Abstract

Introduction: Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format.

Objectives: The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps.

Methods: A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk.

Results: A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data.

Conclusions: Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.

Keywords: data confidentiality; data utility; federated learning; review; synthetic data.

PubMed Disclaimer

Conflict of interest statement

Statement on conflicts of interest: The authors declare that they have no conflicts to report.

Figures

**Figure 1: Structure of a typical GAN (Generative Adversarial Network)**

**Figure 2: PRISMA flow diagram for the literature database and website search**

**Figure 3: Characteristics of included papers (n = 69)**

**Figure 4: The type of data used by each paper, by the goal of the research**

**Figure 5: The different GAN (Generative Adversarial Network) configurations used on the server and client devices, for federated synthesis**

See this image and copyright information in PMC

References

1. Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Nordholt ES, Spicer K, et al.. Statistical Disclosure Control. John Wiley & Sons, Ltd; 2012. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118348239.
1. Purdam K, Elliot M. A case study of the impact of statistical disclosure control on data quality in the individual UK Samples of Anonymised Records. Environ Plan A. 2007;39(5):1101–18. 10.1068/a38335 - DOI
1. Drechsler J, Reiter JP. Sampling with synthesis: A new approach for releasing public use census microdata. J Am Stat Assoc. 2010;105(492):1347–57. 10.1198/jasa.2010.ap09480 - DOI
1. Dwork C, Smith A, Steinke T, Ullman J. Exposed! A survey of attacks on private data. Annu Rev Stat Its Appl. 2017;4(1):61–84. 10.1146/annurev-statistics-060116-054123 - DOI
1. Rocher L, Hendrickx JM, de Montjoye YA. Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun. 2019;10(3069). 10.1038/s41467-019-10933-3 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Federated learning for generating synthetic data: a scoping review

Affiliations

Federated learning for generating synthetic data: a scoping review

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials