A synthetic data integration framework to leverage external summary-level information from heterogeneous populations
- PMID: 36876883
- PMCID: PMC10480346
- DOI: 10.1111/biom.13852
A synthetic data integration framework to leverage external summary-level information from heterogeneous populations
Abstract
There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.
Keywords: data integration; prediction models; stacked multiple imputation; synthetic data.
© 2023 The Authors. Biometrics published by Wiley Periodicals LLC on behalf of International Biometric Society.
Figures




Similar articles
-
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340. Health Technol Assess. 2006. PMID: 16959170
-
Eliciting adverse effects data from participants in clinical trials.Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2. Cochrane Database Syst Rev. 2018. PMID: 29372930 Free PMC article.
-
Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.Cochrane Database Syst Rev. 2008 Jul 16;(3):CD001230. doi: 10.1002/14651858.CD001230.pub2. Cochrane Database Syst Rev. 2008. PMID: 18646068
-
Short-Term Memory Impairment.2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 31424720 Free Books & Documents.
-
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3. Cochrane Database Syst Rev. 2022. PMID: 35593186 Free PMC article.
Cited by
-
Likelihood adaptively incorporated external aggregate information with uncertainty for survival data.Biometrics. 2024 Oct 3;80(4):ujae120. doi: 10.1093/biomtc/ujae120. Biometrics. 2024. PMID: 39468742
-
Federated and distributed learning applications for electronic health records and structured medical data: a scoping review.J Am Med Inform Assoc. 2023 Nov 17;30(12):2041-2049. doi: 10.1093/jamia/ocad170. J Am Med Inform Assoc. 2023. PMID: 37639629 Free PMC article.
-
Federated Learning in Healthcare: A Benchmark Comparison of Engineering and Statistical Approaches for Structured Data Analysis.Health Data Sci. 2024 Dec 4;4:0196. doi: 10.34133/hds.0196. eCollection 2024. Health Data Sci. 2024. PMID: 39635226 Free PMC article.
References
-
- Bareinboim E and Pearl J (2013). A general algorithm for deciding transportability of experimental results. Journal of Causal Inference 1, 107–134.