The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models
- PMID: 30202918
- PMCID: PMC7868050
- DOI: 10.1093/biostatistics/kxy044
The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models
Abstract
Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.
Keywords: Cross-study validation; Data heterogeneity; Genomic prediction models.
© The Author 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Figures




Similar articles
-
DisBalance: a platform to automatically build balance-based disease prediction models and discover microbial biomarkers from microbiome data.Brief Bioinform. 2021 Sep 2;22(5):bbab094. doi: 10.1093/bib/bbab094. Brief Bioinform. 2021. PMID: 33834198
-
Bioinformatics/biostatistics: microarray analysis.Methods Mol Biol. 2012;823:347-58. doi: 10.1007/978-1-60327-216-2_22. Methods Mol Biol. 2012. PMID: 22081356
-
Genomic prediction based on data from three layer lines using non-linear regression models.Genet Sel Evol. 2014 Nov 6;46(1):75. doi: 10.1186/s12711-014-0075-3. Genet Sel Evol. 2014. PMID: 25374005 Free PMC article.
-
An Introduction to Whole-Metagenome Shotgun Sequencing Studies.Methods Mol Biol. 2021;2243:107-122. doi: 10.1007/978-1-0716-1103-6_6. Methods Mol Biol. 2021. PMID: 33606255 Review.
-
Single-Cell Genomics.Clin Chem. 2019 Aug;65(8):972-985. doi: 10.1373/clinchem.2017.283895. Epub 2019 Mar 14. Clin Chem. 2019. PMID: 30872376 Review.
Cited by
-
Leveraging Multi-omics to Disentangle the Complexity of Ovarian Cancer.Mol Diagn Ther. 2025 Mar;29(2):145-151. doi: 10.1007/s40291-024-00757-3. Epub 2024 Nov 18. Mol Diagn Ther. 2025. PMID: 39557776 Review.
-
Tree-Weighting for Multi-Study Ensemble Learners.Pac Symp Biocomput. 2020;25:451-462. Pac Symp Biocomput. 2020. PMID: 31797618 Free PMC article.
-
Novel molecular classification and prognosis of papillary renal cell carcinoma based on a large-scale CRISPR-Cas9 screening and machine learning.Heliyon. 2023 Dec 3;10(1):e23184. doi: 10.1016/j.heliyon.2023.e23184. eCollection 2024 Jan 15. Heliyon. 2023. PMID: 38163209 Free PMC article.
-
Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.PLoS Comput Biol. 2023 Oct 16;19(10):e1010608. doi: 10.1371/journal.pcbi.1010608. eCollection 2023 Oct. PLoS Comput Biol. 2023. PMID: 37844077 Free PMC article.
-
Robustifying genomic classifiers to batch effects via ensemble learning.Bioinformatics. 2021 Jul 12;37(11):1521-1527. doi: 10.1093/bioinformatics/btaa986. Bioinformatics. 2021. PMID: 33245114 Free PMC article.
References
-
- Aalen, O. (1978). Nonparametric inference for a family of counting processes. The Annals of Statistics 6, 701–726.
-
- Bender, R., Augustin, T. and Blettner, M. (2005). Generating survival times to simulate cox proportional hazards models. Statistics in Medicine. 24, 1713–1723. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources