Developing a New Method for Combining Data from Multiple Studies [Internet]
- PMID: 40168497
- Bookshelf ID: NBK613060
- DOI: 10.25302/10.2022.ME.160234530
Developing a New Method for Combining Data from Multiple Studies [Internet]
Excerpt
Background: Disease risk prediction models are critical for delivering the promise of precision medicine in developing risk-based individualized strategies for preventive interventions. As large-scale epidemiologic studies, such as genome-wide association studies (GWAS), lead to the discovery of new risk factors, there is an urgent need for robust statistical methods and software tools to aid the development, validation, and application of comprehensive models for predicting future risk of diseases at individual levels. We identified research gaps in 3 areas: (1) the development of flexible yet parsimonious models for a multivariate risk profile associated with many risk factors, (2) building of multivariate risk models through synthesis of data from multiple disparate data sources, and (3) model validation in prospective cohort studies that employ complex sampling for biomarker data collection.
Objective: Develop novel statistical methods to address current research gaps in the development, validation, and application of risk prediction models.
Methods: Efficient statistical methods are developed to analyze case–control studies to model gene-environment interactions (GEIs) using polygenic risk scores (PRSs), a measure of the total genetic burden of a disease with a combination of different genetic variants. In this report, we propose a new framework is for conducting generalized meta-analysis to develop comprehensive, multifactorial risk models combining data across studies that may have disparate covariate information. Further, a new method is proposed to evaluate the performance of risk models in validation cohorts that employ 2-phase study designs to restrict the evaluation of expensive covariates on efficiently selected subsamples. All methods are evaluated through realistic simulation studies for bias, efficiency, and sensitivity to the underlying modeling assumptions. Applications of the novel methods for modeling risks of breast and lung cancers are illustrated using data sets from epidemiologic cohort and case–control studies. In addition, the Individualized Coherent Absolute Risk Estimator (iCARE) R software package is updated to include a model validation component and is then used to develop and validate multiple models for predicting the absolute risk of breast cancer.
Results: When we regressed a disease-associated PRS on a set of environmental risk factors in a sample restricted to cases, linear regression resulted in a major gain in the efficiency of estimating the GEI compared with logistic regression. The analysis assumed independence of gene and environmental risk factors. Further, covariate adjustment using PRS associated with certain endogenous exposures, such as body mass index, can account for possible bias resulting from gene-environment dependence. Generalized meta-analysis allows efficient estimation of parameters for underlying multifactorial risk models by combining information across studies, which can be limited individually because of small sample size or/and lack of complete information on relevant risk factors. The method is reasonably robust to the violation of various assumptions, and a diagnostic test is available to detect potential heterogeneity across studies. Further, the efficiency of model validation analysis in 2-phase cohort studies can be largely improved by using incomplete data from participants unselected at phase 2 through partial risk-score information. Finally, we developed a new model for predicting absolute risk of breast cancer combining information from multiple data sources and conducted extensive validation studies in external cohort studies. The results show that our methods and software packages provide a robust framework for developing models for estimating absolute risks, that is, the probability of future disease incidence over specified time intervals, by integrating information from various sources.
Conclusions: A series of novel statistical methods were developed for building multifactorial risk models through powerful analysis of GEIs and data integration across multiple disparate studies. New methods are also developed for conducting efficient model validation analysis in epidemiologic studies that employ complex sampling designs. The new methods and related software tools will aid in the future delivery of risk tools for clinical applications.
Limitations: Further development of methods is needed to account for population stratification in case–control studies of GEIs, extending generalized meta-analysis for combining studies that employ different sampling designs (eg, case–control vs cohort) and evaluating alternative metrics for model performance in validation studies with complex sampling designs.
Copyright © 2022. Johns Hopkins University. All Rights Reserved.
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials
