Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec;79(4):3831-3845.
doi: 10.1111/biom.13852. Epub 2023 Apr 4.

A synthetic data integration framework to leverage external summary-level information from heterogeneous populations

Affiliations

A synthetic data integration framework to leverage external summary-level information from heterogeneous populations

Tian Gu et al. Biometrics. 2023 Dec.

Abstract

There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.

Keywords: data integration; prediction models; stacked multiple imputation; synthetic data.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Diagram of the four-step proposed synthetic data integration strategy (SynDI).
Figure 2:
Figure 2:
Simulation settings snapshot.
Figure 3:
Figure 3:
Visualization of simulation I results over increasing synthetic data size (a) point estimates of γ (b) variance estimators vs. the empirical variance of γ^ (c) comparison of the proposed and the existing variance estimators of γ^. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.
Figure 4:
Figure 4:
Visualization of prediction metrics over increasing synthetic data size for simulation II. Larger AUC (area under the curve), smaller SSE (sum of squared error) and smaller BS (Brier Score) represents better prediction accuracy. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

Similar articles

Cited by

References

    1. Antonelli J, Zigler C, and Dominici F (2017). Guided Bayesian imputation to adjust for confounding when combining heterogeneous data sources in comparative effectiveness research. Biostatistics 18, 553–568. - PMC - PubMed
    1. Bareinboim E and Pearl J (2013). A general algorithm for deciding transportability of experimental results. Journal of Causal Inference 1, 107–134.
    1. Beesley LJ and Taylor JMG (2021a). Accounting for not-at-random missingness through imputation stacking. Statistics in Medicine 40, 6118–6132. - PMC - PubMed
    1. Beesley LJ and Taylor JMG (2021b). A stacked approach for chained equations multiple imputation incorporating the substantive model. Biometrics 77, 1342–1354. - PMC - PubMed
    1. Boonstra PS and et al. (2020). Incorporating historical models with adaptive Bayesian updates. Biostat. 21, e47–e64. - PMC - PubMed

Publication types