Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 31;21(1):19.
doi: 10.1186/s12963-023-00319-5.

Constructing synthetic populations in the age of big data

Affiliations

Constructing synthetic populations in the age of big data

Mioara A Nicolaie et al. Popul Health Metr. .

Abstract

Background: To develop public health intervention models using micro-simulations, extensive personal information about inhabitants is needed, such as socio-demographic, economic and health figures. Confidentiality is an essential characteristic of such data, while the data should reflect realistic scenarios. Collection of such data is possible only in secured environments and not directly available for open-source micro-simulation models. The aim of this paper is to illustrate a method of construction of synthetic data by predicting individual features through models based on confidential data on health and socio-economic determinants of the entire Dutch population.

Methods: Administrative records and health registry data were linked to socio-economic characteristics and self-reported lifestyle factors. For the entire Dutch population (n = 16,778,708), all socio-demographic information except lifestyle factors was available. Lifestyle factors were available from the 2012 Dutch Health Monitor (n = 370,835). Regression model was used to sequentially predict individual features.

Results: The synthetic population resembles the original confidential population. Features predicted in the first stages of the sequential procedure are virtually similar to those in the original population, while those predicted in later stages of the sequential procedure carry the accumulation of limitations furthered by data quality and previously modelled features.

Conclusions: By combining socio-demographic, economic, health and lifestyle related data at individual level on a large scale, our method provides us with a powerful tool to construct a synthetic population of good quality and with no confidentiality issues.

Keywords: Disclosure risk; Synthetic population.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Estimated BMI mean and corresponding 95% CI for men in each 10-year age class (1 =  < 20, 2 = 20–29, 3 = 30–39, 4 = 40–49, 5 = 50–59, 6 = 60–69, 7 = 70–79, 8 = 80 +) by smoking status and educational level in the synthetic population (blue) and in the confidential original data from the DPHM survey (green)
Fig. 2
Fig. 2
Estimated BMI mean and corresponding 95% CI for women in each 10-year age class (1 =  < 20, 2 = 20–29, 3 = 30–39, 4 = 40–49, 5 = 50–59, 6 = 60–69, 7 = 70–79, 8 = 80 +) by smoking status and educational level in the synthetic population (blue) and in the confidential original data from the DPHM survey (green)
Fig. 3
Fig. 3
Prevalence of being sufficiently physical active and corresponding 95% CI for men in each 10-year age class (1 =  < 20,2 = 20–29, 3 = 30–39, 4 = 40–49, 5 = 50–59, 6 = 60–69, 7 = 70–79, 8 = 80 +) by smoking status and educational level in the synthetic population (blue) and in the confidential original data from the DPHM (green)
Fig. 4
Fig. 4
Prevalence of being sufficiently physical active and corresponding 95% CI for women in each 10-year age class (1 =  < 20,2 = 20–29, 3 = 30–39, 4 = 40–49, 5 = 50–59, 6 = 60–69, 7 = 70–79, 8 = 80 +) by smoking status and educational level in the synthetic population (blue) and in the confidential original data from the DPHM (green)
Fig. 5
Fig. 5
Estimated mean of lung cancer presence and corresponding 95% CI for men stratified by education level and across 8 age classes (1 =  < 20, 2 = 20–29, 3 = 30–39, 4 = 40–49, 5 = 50–59, 6 = 60–69, 7 = 70–79, 8 = 80 +) in the synthetic population (blue) and in the confidential original data of the DPHM survey linked to the cancer registry (green)
Fig. 6
Fig. 6
Estimated mean of lung cancer presence and corresponding 95% CI for women stratified by education level and across 8 age classes (1 =  < 20, 2 = 20–29, 3 = 30–39, 4 = 40–49, 5 = 50–59, 6 = 60–69, 7 = 70–79, 8 = 80 +) in the synthetic population (blue) and in the confidential original data of the DPHM survey linked to the cancer registry (green)
Fig. 7
Fig. 7
Percentage of smoker categories across 8 age classes (1 =  < 20, 2 = 20–29, 3 = 30–39, 4 = 40–49, 5 = 50–59, 6 = 60–69, 7 = 70–79, 8 = 80 +) and stratified by education level for men in the synthetic population (blue) and in the original confidential data of the DPHM survey (green)
Fig. 8
Fig. 8
Percentage of smoker categories across 8 age classes (1 =  < 20, 2 = 20–29, 3 = 30–39, 4 = 40–49, 5 = 50–59, 6 = 60–69, 7 = 70–79, 8 = 80 +) and stratified by education level for women in the synthetic population (blue) and in the original confidential data of the DPHM survey (green)

References

    1. Alfons A, Kraft S, Templ M, Filzmoser P. Simulation of synthetic population data for household surveys with application to EU-SILC. Research Report CS-2010-1, Department of Statistics and Probability Theory, Vienna University of Technology; 2010.
    1. Barthelemy J, Cornelis E. Synthetic population: review of the existing approaches. Esch-sur-Alzette: LISER; 2012.
    1. Beckman RJ, Baggerly KA, McKay MD. Creating synthetic baseline populations. Transp Res. 1996;30(6):415–429.
    1. Centraal Bureau voor de Statistiek. Opbouw en instructie totaalbestand Gezondheidsmonitor Volwassenen 2012 [Internet]. Centraal Bureau voor de Statistiek. 2015. https://www.cbs.nl/nl-nl/onze-diensten/methoden/onderzoeksomschrijvingen....
    1. Boshuizen HC, Lhachimi SK, van Baal PHM, Hoogenveen RT, Smit HA, Mackenbach JP, Nusselder WJ. The DYNAMO-HIA model: an efficient implementation of a risk factor/chronic disease Markov model for use in Health Impact Assessment (HIA) Demography. 2012;49(4):1259–1283. doi: 10.1007/s13524-012-0122-z. - DOI - PubMed

LinkOut - more resources