Review

Developing a New Method for Combining Data from Multiple Studies [Internet]

Nilanjan Chatterjee^{1

2

3}, Allison Meisner⁴, Parichoy Pal Choudhury⁵, Prosenjit Kundu¹, Kala Visvanathan^{3

6}, Scott Zeger^{1

2

3

7

8}

Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2022 Oct.

PCORI Final Research Reports.

Affiliations

¹ Department of Biostatistics, Bloomberg School of Public Health, The Johns Hopkins University, Baltimore, Maryland
² Department of Epidemiology, Bloomberg School of Public Health, The Johns Hopkins University, Baltimore, Maryland
³ Department of Oncology, School of Medicine, The Johns Hopkins University, Baltimore, Maryland
⁴ Fred Hutchinson Cancer Research Center, Seattle, Washington
⁵ Division of Cancer Epidemiology and Genetics, NCI Shady Grove, Bethesda, Maryland
⁶ Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University, Baltimore, Maryland
⁷ Department of International Health, Bloomberg School of Public Health, The Johns Hopkins University, Baltimore, Maryland
⁸ Krieger School of Arts and Sciences, The Johns Hopkins University, Baltimore, Maryland

PMID: 40168497
Bookshelf ID: NBK613060
DOI: 10.25302/10.2022.ME.160234530

Free Books & Documents

Review

Developing a New Method for Combining Data from Multiple Studies [Internet]

Nilanjan Chatterjee et al.

Free Books & Documents

Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2022 Oct.

PCORI Final Research Reports.

Authors

Nilanjan Chatterjee^{1

2

3}, Allison Meisner⁴, Parichoy Pal Choudhury⁵, Prosenjit Kundu¹, Kala Visvanathan^{3

6}, Scott Zeger^{1

2

3

7

8}

Affiliations

¹ Department of Biostatistics, Bloomberg School of Public Health, The Johns Hopkins University, Baltimore, Maryland
² Department of Epidemiology, Bloomberg School of Public Health, The Johns Hopkins University, Baltimore, Maryland
³ Department of Oncology, School of Medicine, The Johns Hopkins University, Baltimore, Maryland
⁴ Fred Hutchinson Cancer Research Center, Seattle, Washington
⁵ Division of Cancer Epidemiology and Genetics, NCI Shady Grove, Bethesda, Maryland
⁶ Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University, Baltimore, Maryland
⁷ Department of International Health, Bloomberg School of Public Health, The Johns Hopkins University, Baltimore, Maryland
⁸ Krieger School of Arts and Sciences, The Johns Hopkins University, Baltimore, Maryland

PMID: 40168497
Bookshelf ID: NBK613060
DOI: 10.25302/10.2022.ME.160234530

Excerpt

Background: Disease risk prediction models are critical for delivering the promise of precision medicine in developing risk-based individualized strategies for preventive interventions. As large-scale epidemiologic studies, such as genome-wide association studies (GWAS), lead to the discovery of new risk factors, there is an urgent need for robust statistical methods and software tools to aid the development, validation, and application of comprehensive models for predicting future risk of diseases at individual levels. We identified research gaps in 3 areas: (1) the development of flexible yet parsimonious models for a multivariate risk profile associated with many risk factors, (2) building of multivariate risk models through synthesis of data from multiple disparate data sources, and (3) model validation in prospective cohort studies that employ complex sampling for biomarker data collection.

Objective: Develop novel statistical methods to address current research gaps in the development, validation, and application of risk prediction models.

Methods: Efficient statistical methods are developed to analyze case–control studies to model gene-environment interactions (GEIs) using polygenic risk scores (PRSs), a measure of the total genetic burden of a disease with a combination of different genetic variants. In this report, we propose a new framework is for conducting generalized meta-analysis to develop comprehensive, multifactorial risk models combining data across studies that may have disparate covariate information. Further, a new method is proposed to evaluate the performance of risk models in validation cohorts that employ 2-phase study designs to restrict the evaluation of expensive covariates on efficiently selected subsamples. All methods are evaluated through realistic simulation studies for bias, efficiency, and sensitivity to the underlying modeling assumptions. Applications of the novel methods for modeling risks of breast and lung cancers are illustrated using data sets from epidemiologic cohort and case–control studies. In addition, the Individualized Coherent Absolute Risk Estimator (iCARE) R software package is updated to include a model validation component and is then used to develop and validate multiple models for predicting the absolute risk of breast cancer.

Results: When we regressed a disease-associated PRS on a set of environmental risk factors in a sample restricted to cases, linear regression resulted in a major gain in the efficiency of estimating the GEI compared with logistic regression. The analysis assumed independence of gene and environmental risk factors. Further, covariate adjustment using PRS associated with certain endogenous exposures, such as body mass index, can account for possible bias resulting from gene-environment dependence. Generalized meta-analysis allows efficient estimation of parameters for underlying multifactorial risk models by combining information across studies, which can be limited individually because of small sample size or/and lack of complete information on relevant risk factors. The method is reasonably robust to the violation of various assumptions, and a diagnostic test is available to detect potential heterogeneity across studies. Further, the efficiency of model validation analysis in 2-phase cohort studies can be largely improved by using incomplete data from participants unselected at phase 2 through partial risk-score information. Finally, we developed a new model for predicting absolute risk of breast cancer combining information from multiple data sources and conducted extensive validation studies in external cohort studies. The results show that our methods and software packages provide a robust framework for developing models for estimating absolute risks, that is, the probability of future disease incidence over specified time intervals, by integrating information from various sources.

Conclusions: A series of novel statistical methods were developed for building multifactorial risk models through powerful analysis of GEIs and data integration across multiple disparate studies. New methods are also developed for conducting efficient model validation analysis in epidemiologic studies that employ complex sampling designs. The new methods and related software tools will aid in the future delivery of risk tools for clinical applications.

Limitations: Further development of methods is needed to account for population stratification in case–control studies of GEIs, extending generalized meta-analysis for combining studies that employ different sampling designs (eg, case–control vs cohort) and evaluating alternative metrics for model performance in validation studies with complex sampling designs.

PubMed Disclaimer

Publication types

Actions

Grants and funding

Original Project Title: Statistical Methods for Development, Validation, and Implementation of Absolute Risk Models

LinkOut - more resources

Full Text Sources
- NCBI Bookshelf
- Patient-Centered Outcomes Research Institute
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Developing a New Method for Combining Data from Multiple Studies [Internet]

Affiliations

Developing a New Method for Combining Data from Multiple Studies [Internet]

Authors

Affiliations

Excerpt

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials